-
Notifications
You must be signed in to change notification settings - Fork 280
Update PrePush Modified Files and Lines with Multiple Branches Push #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sds
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @taufek, thanks for your continued iteration here.
Overall looks fine, but I did have a concern about performance which I'm hoping you can answer. Thanks!
| "#{pushed_refs[0].remote_sha1}..#{pushed_refs[0].local_sha1}" | ||
| @modified_lines[file] = pushed_refs.map do |pushed_ref| | ||
| pushed_ref.modified_lines_in_file(file) | ||
| end.to_set.flatten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand why you are using to_set.flatten here?
modified_lines_in_file returns a Set, which you map over to produce an array of sets that you then call to_set on, followed by flatten. This can be potentially expensive for a large number of files and/or modifies lines.
Consider the following benchmark:
require 'benchmark'
require 'set'
iterations = 10
sets = 100.times.map do
Array.new(1_000) { rand(0..1_000) }.to_set
end
Benchmark.bmbm do |x|
x.report('.to_set.flatten') do
iterations.times do
sets.to_set.flatten
end
end
x.report('each_with_object{ merge }') do
iterations.times do
sets.each_with_object(Set.new) { |item, set| set.merge(item) }
end
end
end
puts 'Are they equivalent?'
puts sets.to_set.flatten == sets.each_with_object(Set.new) { |item, set| set.merge(item) }Rehearsal -------------------------------------------------------------
.to_set.flatten 0.340000 0.000000 0.340000 ( 0.344824)
each_with_object{ merge } 0.020000 0.000000 0.020000 ( 0.016745)
---------------------------------------------------- total: 0.360000sec
user system total real
.to_set.flatten 0.360000 0.000000 0.360000 ( 0.361935)
each_with_object{ merge } 0.020000 0.000000 0.020000 ( 0.016793)
Are they equivalent?
true
In a nutshell, it would be significantly faster for large sets of files/lines if we didn't use to_set.flatten, but I want to make sure I understand your reasoning here in case I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flatten is called because I assume the expected output is non-nested Set for modified_lines_in_file and non-nested Array for modified_files.
For example, modified_lines_in_files with 2 sets on PushedRef (when I pushed 2 branches) might output [<Set(1,2)>, <Set(3,4)>]. Then callingto_set.flatten, converts it to <Set(1,2,3,4)>.
I definitely will go with your suggestion if the output is the same since it has better performance.
|
|
||
| def modified_files | ||
| @modified_files ||= Overcommit::GitRepo.modified_files(refs: ref_range) | ||
| @modified_files ||= pushed_refs.map(&:modified_files).flatten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments below for how you may want to speed this up for large ref ranges containing many large file sets.
| end | ||
| end | ||
|
|
||
| def modified_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sds ,
I want to ask your opinion on this one. If I push changes on the same file on separate branch/local_ref:
- Should we convert the output to unique list of files.
[‘file_one’]
Or
- Should we distinguish the file from different local_ref by appending sort of prefix. For example:
[‘<local_ref_1>:file_one’, ‘<local_ref_2>:file_one’]
Or
- Do nothing. Let there be duplicate of file from different local_ref.
[‘file_one’, ‘file_one’]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, @taufek. If I try to put myself in the shoes of a developer writing a pre-push hook, the least-surprising behavior to me would be net set of differences between the local and remote branches, and thus the unique list of files (option 1) seems reasonable to me.
If you have other ideas about why others would be better, let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Unique list of files make sense to me. Thanks.
e645f1e to
97fa56c
Compare
|
@sds , It's ready for re-review. |
In `pre-push` hook, it receives multiple ref ranges when we pushed multiple branches to remote. Below is pushing multiple branches plus setting up the upstream branch on remote. ``` command: git push origin -u tj-test tj-test-2 stdin: refs/heads/tj-test 62738e2 refs/heads/tj-test 0000000000000000000000000000000000000000 refs/heads/tj-test-2 feb40b5 refs/heads/tj-test-2 0000000000000000000000000000000000000000 ``` Below is pushing multiple branches which already has upstream branch on remote. ``` command: git push origin tj-test tj-test-2 stdin: refs/heads/tj-test 61bc13ab2e717633441855d534904c5036f342dd refs/heads/tj-test 62738e2 refs/heads/tj-test-2 1e371d2c345e00c34410372dad2d7602aca99908 refs/heads/tj-test-2 feb40b5 ``` With these multiple ref ranges, we should read all of them during pre-push `modified_files` and `modified_lines_in_file`.
97fa56c to
5a5dc4a
Compare
|
|
||
| def modified_files | ||
| @modified_files ||= Overcommit::GitRepo.modified_files(refs: ref_range) | ||
| @modified_files ||= pushed_refs.map(&:modified_files).flatten.uniq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require 'benchmark'
require 'set'
iterations = 10
data = 100.times.map do
Array.new(1_000) { rand(0..1_000) }
end
Benchmark.bmbm do |x|
x.report('array.flatten.uniq') do
iterations.times do
data.flatten.uniq
end
end
x.report('array.each_with_object{ merge }.to_a') do
iterations.times do
data.each_with_object(Set.new) do |datum, set|
set.merge(datum)
end.to_a
end
end
end
puts 'Are they equivalent?'
puts data.flatten.uniq.sort == data.each_with_object(Set.new) { |datum, set| set.merge(datum) }.sort
Rehearsal ------------------------------------------------------------------------
array.flatten.uniq 0.100000 0.010000 0.110000 ( 0.097410)
array.each_with_object{ merge }.to_a 0.140000 0.000000 0.140000 ( 0.148073)
--------------------------------------------------------------- total: 0.250000sec
user system total real
array.flatten.uniq 0.090000 0.000000 0.090000 ( 0.100320)
array.each_with_object{ merge }.to_a 0.140000 0.000000 0.140000 ( 0.151008)
Are they equivalent?
true
Seems like for this case, calling flatten.uniq is fast.
| `git commit -m "Update Branch 2" 2>&1 > #{File::NULL}` | ||
|
|
||
| should include(*%w[added-1 update-me added-2].map { |file| File.expand_path(file) }) | ||
| should == %w[added-1 added-2 update-me].map { |file| File.expand_path(file) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing worth noting is that using == here is slightly more brittle, as it prevents us from changing our implementation (which could return the file names in a different order) without breaking tests.
At the end of the day, our goal is to make sure we're returning the correct set of files, not files in a specified order, which is why we used include. However, perhaps you changed this because we weren't checking for unintended files to be included. In that case, you could use a combination of include along with verifying the count to make sure no additional files are in the set.
This is not a big deal, so I'm fine merging this as-is, but it's important we're not overly-specific in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll take a note for future reference. Thanks.
|
Thanks @taufek! |
In
pre-pushhook, it receives multiple ref ranges when we pushed multiple branches to remote.Below is pushing multiple branches plus setting up the upstream branch on remote.
Below is pushing multiple branches which already has upstream branch on remote.
With these multiple ref ranges, we should read all of them during pre-push
modified_filesandmodified_lines_in_file.