/
parseman
executable file
·231 lines (154 loc) · 5.3 KB
/
parseman
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
#!/usr/bin/env perl
use 5.016;
use warnings;
use autodie;
use charnames ':full';
use Encode qw(decode);
use JSON::PP;
use Pod::Usage;
if (!@ARGV) {
pod2usage('Missing arguments.');
}
elsif (grep { /^--?h(elp)?$/ } @ARGV) {
pod2usage(-verbose => 2);
}
my %COMMIT_ARGS = (
hash => '%H',
author_name => '%aN',
author_mail => '%aE',
author_date => '%at',
committer_name => '%cN',
committer_mail => '%cE',
committer_date => '%ct',
subject => '%s',
body => '%b',
notes => '%N',
signer => '%GS',
signer_key => '%GK',
);
my @COMMIT_KEYS = sort keys %COMMIT_ARGS;
my @FORMAT_ARGS = @COMMIT_ARGS{@COMMIT_KEYS};
my $COMMIT_SEPARATOR = "\N{RECORD SEPARATOR}";
my $DIFF_SEPARATOR = "\N{UNIT SEPARATOR}";
# See the “Internals” in the documentation (run `perldoc parseman`)
my $FORMAT = $COMMIT_SEPARATOR . join('%x00', @FORMAT_ARGS) . $DIFF_SEPARATOR;
sub parse_log {
my ($log) = @_;
my @fields = split "\0", $log, -1;
return map { $COMMIT_KEYS[$_] => $fields[$_] } 0 .. $#fields;
}
sub parse_diff {
my ($diff) = @_;
my %changes = (
touched_files => 0,
insertions => 0,
deletions => 0,
);
for my $diff_line (split "\n", $diff) {
next unless $diff_line =~ /^(\d+)\s+(\d+)\s+(.+?)\s*$/;
$changes{insertions} += $1;
$changes{deletions } += $2;
$changes{touched_files}++;
}
return %changes;
}
my @git_args;
if ($ENV{IDMAN_BEFORE}) {
push @git_args, "--before=$ENV{IDMAN_BEFORE}";
}
for my $repo_path (@ARGV) {
open my $fh, '-|', 'git', "--git-dir=$repo_path/.git", 'log', '--all',
"--format=$FORMAT", '-M', '-C', '--numstat', @git_args;
read $fh, $_, length $COMMIT_SEPARATOR; # strip leading separator
local $/ = "\n$COMMIT_SEPARATOR";
while (<$fh>) {
chomp;
my ($log, $diff) = split $DIFF_SEPARATOR, decode('UTF-8', $_);
say encode_json({
repo => $repo_path,
parse_log($log),
parse_diff($diff),
});
}
close $fh;
}
__END__
=head1 NAME
parseman - parse commit info from a git directory
=head1 SYNOPSIS
parseman GIT_REPO_FOLDER...
This will parse all commits from the given git repo(s) and dump the results as
JSON objects on stdout, one per line.
=head1 DESCRIPTION
This script will take a git repository and run C<git log --all> on it,
transforming its output into structured JSON. This output can be piped into
another program.
The output consists of several JSON-encoded objects, one object per line. The
consuming program can just read one line at a time, JSON-decode each one and
then handle its contents.
For example output, just run C<./parseman .> to see the output for this here
repository.
Each JSON object is just a mapping from string keys to string values. Empty
values will be empty strings. The following values are in the object:
=over
=item repo
The path to the repository's local folder. If you want to run further git
commands on it, you might need to append C</.git> to it.
=item hash
The commit's sha-1 hash.
=item author_name
The name of the commit's author.
=item author_mail
The e-mail address of the author. Might not actually be a valid e-mail address,
git doesn't check this.
=item author_date
The date that the commit was authored as a Unix timestamp. Note that this is a
string of digits, not an integer.
=item committer_name
The name of the committer, which may or may not be different from the author.
=item committer_mail
The e-mail address of the committer. Same caveat as for the author applies.
=item committer_date
The date that the commit was committed.
=item subject
The commit message subject line.
=item body
The rest of the commit message.
=item notes
The notes attached to the commit. Basically a message in addition to the
regular commit message.
=item signer
Who signed the commit, if anybody did.
=item signer_key
The signature key of who signed the commit.
=item touched_files
=item insertions
=item deletions
The amount of modified files, inserted lines and deleted lines in the commit,
respectively. Renamed files are taken into account properly, so a rename on
its own counts as a single changed file with zero inserted or deleted lines.
See the C<--find-renames> and C<--find-copies> options in C<git log --help> for
details.
=back
=head1 INTERNALS
Calling C<git log> on every commit would be the easiest implementation for this
script, but that is just prohibitively slow. So instead, we run it only one
single time with a specially formatted output so that we can separate each
commit properly.
The format works as follows:
=over
=item
Each commit is prefixed with a C<RECORD SEPARATOR> character. This separates
the commits from each other.
=item
Each of the fields within a commit (hash, message, author name etc.) is
separated with a C<NUL> character.
=item
After those fields follows a C<UNIT SEPARATOR>, after which git automatically
puts the diff information (changed files and lines).
=back
Parsing the format works by splitting the input by C<RECORD SEPARATOR>s and
splitting each of those records by C<UNIT SEPARATOR>s into a fields portion and
a diff portion. The fields are then simply split by C<NUL>s and the diff is
parsed via regular expression.
=cut