Skip to content

westwood846/node-idman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

idman

Map inconsistent git developer metadata to real identities

const idman = require('node-idman');
repoStats = idman('/path/to/repo/');
// => see sample-output.json

The above perform the default identity merge for a repo. An alternative merge algorithm may be specified with the optional second argument. For non-default algorithm, ptionally additional arguments may be provided as vararg. Returns the author and committer of every commit in that repository according to the merged identities, and the identities themselves (example output).

This is actually just a fork of the original idman with some wrapper code added and some some unimportant files removed. When the original idman changes, this repo should still be compatible and able to pull the changes (except for this readme maybe).

Requirements

  • git (obviously)
  • perl 5.16 or higher (git depends on perl, so you probably have it already)
  • node (obviously)
  • some algorithms may require additional things

Output

See sample-output.json for an example. It's the output for this here repository.

The idman output will be a JSON object. It contains the following keys:

identities

An array of identities, each representing an individual contributor. Each identity is a list of [name, e-mail address] tuples.

commits

An object representing all commits in the repository, keyed by their hash. Each individual commit contains the following keys:

  • author, committer: These values are integers referring to indexes in the identities array, or null if no such association exists. Use these to tell who authored, committed or signed this particular commit.
  • author_name, author_mail, committer_name, committer_mail: These are the raw names and e-mails from git. Don't use these for identification, they are raw and the identities aren't merged! Use author and committer instead.
  • repo: The path to the repository's local folder. If you want to run further git commands on it, you might need to append /.git to it.
  • hash: The commit's sha-1 hash.
  • author_date: The date that the commit was authored as a Unix timestamp. Note that this is a string of digits, not an integer.
  • committer_date: The date that the commit was committed.
  • subject: The commit message subject line.
  • body: The rest of the commit message.
  • notes: The notes attached to the commit. Basically a message in addition to the regular commit message.
  • signer: The name or e-mail or whatever else the person who signed the commit put here.
  • signer_key: The signature key of who signed the commit.
  • touched_files, insertions, deletions: The amount of modified files, inserted lines and deleted lines in the commit, respectively. Renamed files are taken into account properly, so a rename on its own counts as a single changed file with zero inserted or deleted lines.

See the --find-renames and --find-copies options in git log --help for details.

Structure

idman is the controller that executes all the pieces in lib and pipes them together properly.

parseman gathers all commit information from git and spews out a JSON object for each of them on stdout.

graphman does the identity merging from the commit information it receives from parseman. It can pick from various identity merging algorithms.

assocman receives the results from graphman and the raw commit information from parseman and associates the two, producing the final output. If the algorithm is bad and results in ambiguous associations, assocman will die.

Most of that code has embedded documentation at the bottom of the respective files. You can see it nicely formatted by running perldoc FILE.

Algorithms

Like occurrence, but ignores case and strips off .(none) at the end of e-mail addresses, which git seems to randomly attach and remove if the e-mail doesn't contain a dot.

As the name implies, this is the default algorithm.

Merges identities if they contain identical artifacts (names or e-mail addresses).

Like occurrence, but merges identity if their normalized Levenshtein distance is less than a predefined threshold. You must specify a threshold, where 0 < threshold <= 1. You can either do this by passing --threshold NUMBER as a command-line argument or by defining the GRAPHMAN_THRESHOLD environment variable.

This algorithm requires the Text::Levenshtein::XS Perl module. Install it via sudo cpan Text::Levenshtein::XS.

An extension of the similarity algorithm above, the same requirements apply. Implements the algorithm used by Bird et al. in the paper “Mining Email social networks”. This does a whole bunch of pre-processing on the identities and pays attention to the difference between usernames and real first and last names.

Papers