Skip to content
Carlos Paradis edited this page Feb 10, 2023 · 78 revisions

1. Introduction

Kaiaulu is designed on top of mainly two principles:

1.1 Familiar Interfaces

Kaiaulu is an R package. If you have used R before, then this architecture should be familiar to you. R packages are commonly defined as a set of functions (API), plus one or more R Notebooks explaining their usage to conduct analysis. To understand Kaiaulu, as one would understand any other R package, one usually looks around for R Notebooks that perform an analysis of interest and then read the documentation of the used functions. This means Kaiaulu is closer in design to PyDriller than a tool such as Perceval, which provides only a CLI.

The choice of API first over CLI is to encourage exploration in intermediate steps. Kaiaulu is intended to be used as a toolbox to construct different analyses on software development stacks (git log, mailing list, issue trackers, source code, etc.). Combining data from these resources is often a messy process that requires the use of heuristics and careful consideration of the open-source project analyzed. Instead of creating an end-to-end pipeline ``bleeding'' a large number of intermediate files for verification based on what we think is important to you, we instead provide the working environment in the R Notebook so you may explore at your own leisure.

To aid that, as also is done in R, most data representations functions will return to you will be one or more tables. Interested in parsing the git log? parse_gitlog() returns you a table. Mailing list? parse_mbox() returns you a table. This means it is easy to explore, export, and combine any intermediate step of the analysis.

This idea is extended to third-party software Kaiaulu interfaces with. Suppose you wish to understand in an open-source project how co-change relates to static code dependency. One way to do this would be to use the tools Perceval and Depends. However, this means you now need to learn the tool's interface and reshape the data so you can combine and compare them side by side. Kaiaulu provides R functions that accept each tool's binary file path, and does that for you instead, returning tables, which you can compare.

1.2 Project Configuration Files

When analyzing open-source projects, oftentimes we need to manually inspect conventions used by its community. For example, do developers provide traceability from commits to issues by annotating their commit messages? If yes, what format do they adopt? What infrastructure does this project use? Etc.

While R Notebooks serve as a means to document different kinds of analysis, it does not "scale" well if we are interested in documenting the information above for the same analysis over multiple projects and would like to ensure reproducibility. For example, should we store 60 notebooks with almost the exact same code, but for different projects?

To address this issue, Kaiaulu takes inspiration from Codeface config files to de-couple project information from R Notebooks. Codeface, however, is a CLI-only tool. So we approach project configuration files a bit differently to stick close to Familiar Interfaces.

First, Kaiaulu API does not rely on project configuration files. That is, no function will expect as input a project configuration file. This ensures as an R package, its abstraction remains close and consistent to an R package, and its data types. In an R Notebook, we use project configuration files but only at the very beginning of the Notebook. In essence, we initialize variables reading the project configuration file, instead of hard-coding it on the Notebooks. This means you can inspect each variable being loaded in the Notebook or even hard code the values used. As one would look for R Notebooks to learn the API, we do so here so to show how project configuration files are used. Finally, in the underwork CLI, which is not commonly present in R packages, we require input project configuration files for analysis.

In essence, project configuration files are simply a way to store in a reproducible and organized manner what we believe minimally could aid in reproducibility to re-run the analysis. We use YAML for the configuration files, but very minimally as bullet lists. This also allows you to annotate your file with comments as you see fit (# symbol), and continually grow a repository of config files, to save re-analysis time.

1.3 Configuration File Setup

To set up a configuration file for a specific project, we can start out by using one of the templates provided in the Kaiaulu conf folder. Then, we must investigate our project in-depth to see what tools they are using (JIRA, BugZilla, etc.). A good way to do this is to visit the project website directly. Once we determine what tools the project uses, we can modify our configuration file appropriately, filling in relevant fields.

Let's look at an example of setting up a configuration file for the Apache Thrift project in order to run download_jira_data.Rmd. First off, we should check that Apache Thrift even uses JIRA, which we can do by browsing the Thrift project website and digging deeper for information. On the Developers page, we find that this project does indeed use JIRA. Thus, let's look at the vignette and our configuration file to see what fields we need to fill. We can see for this vignette we must set the domain, project_key, issues, and issue_comments fields in our template configuration file. The default domain is already what we need, https://issues.apache.org/jira, as we are analyzing an Apache Project. Next, let's set the project_key. To do so, let's take a look at the Thrift JIRA page, linked from the Developers page. Here, we see that the key for Apache Thrift is "THRIFT", which we can enter for the project_key field in our configuration file. Finally, to set the issues and issue_comments fields, a closer look at download_jira_data.Rmd tells us we should set these fields to local paths to files that will store the issues and issue_comments generated by the notebook. You can follow this process to set up your own configuration files for specific projects when running analyses with Kaiaulu.

Clone this wiki locally