Skip to content

sandervdo/metricminer2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetricMiner 2

MetricMiner is a Java framework that helps developers on mining software repositories. With it, you can easily extract information from any Git repository, such as commits, developers, modifications, diffs, and source codes, and quickly export CSV files.

Take a look at our documentation and our many examples. Or talk to us.

Documentation

Getting Started

You simply have to start a Java Project in Eclipse. MetricMiner is on Maven, so you can download all its dependencies by only adding this to your pom.xml. Or, if you want, you can see an example:

<dependency>
	<groupid>br.usp</groupid>
	<artifactid>metricminer</artifactid>
	<version>2.1.0</version>
</dependency> 

Always use the latest version in Maven. You can see them here: http://www.mvnrepository.com/artifact/br.usp/metricminer . You can also see a fully function pom.xml example.

MetricMiner needs a Study. The interface is quite simple: a single execute() method:

import br.com.metricminer2.MetricMiner2;
import br.com.metricminer2.Study;

public class MyStudy implements Study {

	public static void main(String[] args) {
		new MetricMiner2().start(new MyStudy());
	}

	@Override
	public void execute() {
		// do the magic here! ;)
	}
}

All the magic goes inside this method. In there, you will have to configure your study, projects to analyze, metrics to be executed, and output files. Take a look in the example. That's what we have to configure:

public void execute() {
	new RepositoryMining()
		.in(<LIST OF PROJECTS>)
		.through(<COMMITS>)
		.process(<PROCESSOR>, <OUTPUT>)
		.mine();
}

Let's start with something simple: we will print the name of the developers for each commit. For now, you should not care about all possible configurations. We will analyze all commits in the project at "/Users/mauricioaniche/workspace/metricminer2", outputing DevelopersVisitor to "/Users/mauricioaniche/Desktop/devs.csv".

  • in(): We use to configure the project (or projects) that will be analyzed.
  • through(): The list of commits to analyze. We want all of them.
  • fromTheBeginning(): Commits will be analysed from the first commit in history to the most recent. Default is the opposite.
  • process(): Visitors that will pass in each commit.
  • mine(): The magic starts!
public void execute() {
	new RepositoryMining()
		.in(GitRepository.singleProject("/Users/mauricioaniche/workspace/metricminer2"))
		.through(Commits.all())
		.process(new DevelopersVisitor(), new CSVFile("/Users/mauricioaniche/Desktop/devs.csv"))
		.mine();
}

In practice, MetricMiner will open the Git repository and will extract all information that is inside. Then, the framework will pass each commit to all processors. Let's write our first DevelopersProcessor. It is fairly simple. All we will do is to implement CommitVisitor. And, inside of process(), we print the commit hash and the name of the developer. MetricMiner gives us nice objects to play with all the data:

import br.com.metricminer2.domain.Commit;
import br.com.metricminer2.persistence.PersistenceMechanism;
import br.com.metricminer2.scm.CommitVisitor;
import br.com.metricminer2.scm.SCMRepository;

public class DevelopersVisitor implements CommitVisitor {

	@Override
	public void process(SCMRepository repo, Commit commit, PersistenceMechanism writer) {
		
		writer.write(
			commit.getHash(),
			commit.getCommitter().getName()
		);

	}

	@Override
	public String name() {
		return "developers";
	}

}

That's it, we are ready to go! If we execute it, we will have the CSV printed into "/Users/mauricioaniche/Desktop/devs.csv". Take a look.

I bet you never found a framework simple as this one!

Configuring the project

The first thing you configure in MetricMiner is the project you want to analyze. MetricMiner currently only suports Git repositories. The GitRepository class contains two factory methods to that:

  • singleProject(path): When you want to analyze a single repository.
  • allProjectsIn(path): When you want to analyze many repositories. In this case, you should pass a path to which all projects are sub-directories of it. Each directory will be considered as a project to MetricMiner.

Logging

MetricMiner uses log4j to print useful information about its execution. We recommend tou to have a log4.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">

<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">

    <appender name="main" class="org.apache.log4j.ConsoleAppender">
        <layout class="org.apache.log4j.PatternLayout">
            <param name="ConversionPattern" value="%d{HH:mm:ss} %5p %m%n"/>
        </layout>
    </appender>

    <category name="br.com.metricminer2">
        <priority value="INFO"/>
        <appender-ref ref="main"/>
    </category>

    <category name="/">
        <priority value="INFO"/>
        <appender-ref ref="main"/>
    </category>

</log4j:configuration>

Selecting the Commit Range

MetricMiner allows you to select the range of commits to be processed. The class Commits contains different methods to that:

  • all(): All commits. From the first to the last.
  • onlyInHead(): It only analyzes the most recent commit.
  • single(hash): It only analyzes a single commit with the provided hash.
  • monthly(months): It selects one commit per month, from the beginning to the end of the repo.
  • list(commits...): The list of commits to be processed.
  • range(start,end): The range of commits, starting at "start" hash, ending at "end" hash.
  • betweenDates(from,to): The range of commits, starting at "from" timestamp, ending at "to" timestamp.

One interesting thing about MetricMiner is that is avoids huge commits. When a commit contains too many files (> 50), it will be ignored.

Getting Modifications

You can get the list of modified files, as well as their diffs and current source code. To that, all you have to do is to get the list of _Modification_s that exists inside Commit. A Commit contains a hash, a committer (name and email), an author (name, and email) a message, the date, its parent hash, and the list of modification.

@Override
public void process(SCMRepository repo, Commit commit, PersistenceMechanism writer) {
	
	for(Modification m : commit.getModifications()) {
		writer.write(
				commit.getHash(),
				commit.getAuthor().getName(),
				commit.getCommitter().getName(),
				m.getFileName(),
				m.getType()
		);
		
	}
}

A Modification contains a type (ADD, COPY, RENAME, DELETE, MODIFY), a diff (with the exact format Git delivers) and the current source code. Remember that it is up to you to handle deleted or renamed files in your study.

Branches

MetricMiner puts all commits from all branches in a single sentence. It means that different commits from different branches will appear. It is your responsibility to filter the branches in your CommitVisitor.

The Commit class contains the getBranches() method. It returns the list of branches in which a commit belongs to. If you want to use only commits in the master branch, you can simply check whether 'master' in inside this set.

Note about the implementation: This is not supported by JGit, so it makes use of Git directly.

Blame

The SCM class contains a blame() method, which allows you to blame a file in a specific commit:

List<BlamedLine> blame(String file, String commitToBeBlamed, boolean priorCommit)

You should pass the file name (relative path), the commit which the file should be blamed, and a boolean informing whether you want the file to be blamed before (priorCommit=true) or after (priorCommit=false) the changes of that particular commit.

Managing State in the Visitor

If you need to, you can store state in your visitors. As an example, if you do not want to process a huge CSV, you can pre-process something before. As an example, if you want to count the total number of modified files per developer, you can either output all developers and the quantity of modifications, and then sum it later using your favorite database, or do the math in the visitor. If you decide to do it, it will be your responsibility to save the results afterwards.

import java.util.HashMap;
import java.util.Map;

import br.com.metricminer2.domain.Commit;
import br.com.metricminer2.persistence.PersistenceMechanism;
import br.com.metricminer2.scm.CommitVisitor;
import br.com.metricminer2.scm.SCMRepository;

public class ModificationsVisitor implements CommitVisitor {

	private Map<String, Integer> devs;
	
	public ModificationsVisitor() {
		this.devs = new HashMap<String, Integer>();
	}
	
	@Override
	public void process(SCMRepository repo, Commit commit, PersistenceMechanism writer) {
		
		String dev = commit.getCommitter().getName();
		if(!devs.containsKey(dev)) devs.put(dev, 0);
		
		int currentFiles = devs.get(dev);
		devs.put(dev, currentFiles + commit.getModifications().size());
		
	}

	@Override
	public String name() {
		return "files-per-dev";
	}

}

Parsing Code

You have entire source code of the repository. You may want to analyze it. MetricMiner comes with JDT and ANTLR bundled. JDT is the Eclipse internal parser, so you will not regret to use it.

Let's say we decide to count the quantity of methods in each modified file. All we have to do is to create a CommitVisitor, the way we are used to. This visitor will use our JDTRunner to invoke a JDT visitor (yes, JDT uses visitors as well). Notice that we executed the JDT visitor, and then wrote the result.

import java.io.ByteArrayInputStream;

import br.com.metricminer2.domain.Commit;
import br.com.metricminer2.domain.Modification;
import br.com.metricminer2.parser.jdt.JDTRunner;
import br.com.metricminer2.persistence.PersistenceMechanism;
import br.com.metricminer2.scm.CommitVisitor;
import br.com.metricminer2.scm.SCMRepository;


public class JavaParserVisitor implements CommitVisitor {

	@Override
	public void process(SCMRepository repo, Commit commit, PersistenceMechanism writer) {

		for(Modification m : commit.getModifications()) {
		
			if(m.wasDeleted()) continue;
			
			NumberOfMethodsVisitor visitor = new NumberOfMethodsVisitor();
			new JDTRunner().visit(visitor, new ByteArrayInputStream(m.getSourceCode().getBytes()));
			
			int methods = visitor.getQty();
			
			writer.write(
				commit.getHash(),
				m.getFileName(),
				methods
			);
			
		}
		
	}

	@Override
	public String name() {
		return "java-parser";
	}

}

The visitor is quite simple. It has methods for all different nodes in the file. All you have to do is to visit the right node. As an example, we will visit MethodDeclaration and count the number of times it is invoked (one per each method in this file).

import org.eclipse.jdt.core.dom.ASTVisitor;
import org.eclipse.jdt.core.dom.MethodDeclaration;

public class NumberOfMethodsVisitor extends ASTVisitor {

	private int qty = 0;
	
	public boolean visit(MethodDeclaration node) {
		
		qty++;
		return super.visit(node);
	}
	
	public int getQty() {
		return qty;
	}
}

If you want to see all methods available, check the documentation for ASTVisitor.

Getting the Current Revision

If you need more than just the metadata from the commit, you may also check out to the revision to access all files. This may be useful when you need to parse all files inside that revision.

To that, we will checkout() the revision, and get all files(). It returns the list of all files in the project at that moment. Then, it is up to you to do whatever you want. In here, we will use our NumberOfMethodsVisitor to count the number of files in all Java files. Please, remember to reset() as soon as you finish playing with the files.

import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.List;

import org.apache.commons.io.IOUtils;

import com.mystudy.example4.NumberOfMethodsVisitor;

import br.com.metricminer2.domain.Commit;
import br.com.metricminer2.parser.jdt.JDTRunner;
import br.com.metricminer2.persistence.PersistenceMechanism;
import br.com.metricminer2.scm.CommitVisitor;
import br.com.metricminer2.scm.RepositoryFile;
import br.com.metricminer2.scm.SCMRepository;

public class JavaParserVisitor implements CommitVisitor {

	@Override
	public void process(SCMRepository repo, Commit commit, PersistenceMechanism writer) {

		try {
			repo.getScm().checkout(commit.getHash());
		
			List<RepositoryFile> files = repo.getScm().files();
			
			for(RepositoryFile file : files) {
				if(!file.fileNameEndsWith("java")) continue;
				
				File soFile = file.getFile();
				
				NumberOfMethodsVisitor visitor = new NumberOfMethodsVisitor();
				new JDTRunner().visit(visitor, new ByteArrayInputStream(readFile(soFile).getBytes()));
				
				int methods = visitor.getQty();
				
				writer.write(
						commit.getHash(),
						file.getFullName(),
						methods
				);
				
			}
			
		} finally {
			repo.getScm().reset();
		}
	}
	

	private String readFile(File f) {
		try {
			FileInputStream input = new FileInputStream(f);
			String text = IOUtils.toString(input);
			input.close();
			return text;
		} catch (Exception e) {
			throw new RuntimeException("error reading file " + f.getAbsolutePath(), e);
		}
	}
	

	@Override
	public String name() {
		return "java-parser";
	}

}

Dealing with Threads

How good is your machine? MetricMiner can execute the visitor over many threads. This is just another configuration you set in RepositoryMining. The withThreads() lets you configure the number of threads the framework will use to process everything.

We suggest you to use threads unless your project checkout revisions. The checkout operation in Git changes the disk, so you can't actually parallelize the work.

@Override
public void execute() {
	new RepositoryMining()
		.in(GitRepository.singleProject("/Users/mauricioaniche/workspace/metricminer2"))
		.through(Commits.all())
		.withThreads(3)
		.process(new JavaParserVisitor(), new CSVFile("/Users/mauricioaniche/Desktop/devs.csv"))
		.mine();
}

Using DataJoy

You have a bunch of CSV files. Do whatever you want with them. We usually upload it to a MySQL file or play it with R. If you haven't tried DataJoy yet, you should. It allows you to upload a CSV file and manipulate.

file <- read.csv("methods.csv", header=FALSE)

hist(file$V3)

You can play with the project here.

Creating your own CommitRange

(not written yet)

Creating your own PersistenceMechanism

(not written yet)

How do I cite MetricMiner?

Sokol, Francisco Zigmund, Mauricio Finavaro Aniche, and Marco Gerosa. "MetricMiner: Supporting researchers in mining software repositories." Source Code Analysis and Manipulation (SCAM), 2013 IEEE 13th International Working Conference on. IEEE, 2013.

@inproceedings{sokol2013metricminer,  
title={MetricMiner: Supporting researchers in mining software repositories},  
author={Sokol, Francisco Zigmund and Finavaro Aniche, Mauricio and Gerosa, Marco and others},  
booktitle={Source Code Analysis and Manipulation (SCAM), 2013 IEEE 13th International Working Conference on},  
pages={142--146}, year={2013},  
organization={IEEE}  
}  

How can I discuss about it?

You can subscribe to our mailing list: https://groups.google.com/forum/#!forum/metricminer.

How to Contribute

Required: Git, Maven.

git clone https://github.com/mauricioaniche/metricminer2.git
cd metricminer2/test-repos
unzip *.zip

Then, you can:

  • compile : mvn clean compile
  • test : mvn test
  • eclipse : mvn eclipse:eclipse
  • build : mvn clean compile assembly:single

License

This software is licensed under the Apache 2.0 License.

About

a tool to support researchers on mining software repositories studies

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 95.0%
  • ANTLR 3.1%
  • Shell 1.9%