# Examining the effects of ownership on software quality
## The Case Of Lucene

We want to replicate the [study](http://dl.acm.org/citation.cfm?doid=2025113.2025119 "Examining the effects of ownership on software quality") done by Bird et al. and published at FSE'11. The idea is to see the results of a similar investigation on an OSS system. We select [Lucene](https://lucene.apache.org/core/), a search engine written in Java.

## Data collection

First we need to get the data to create our **table**, in other words we do what is called *data collection*.

In our case, we are interested in checking the relation between some ownership related metrics and post-release bugs. We investigating this relation at *file level*, because we focus on Java and in this language the building blocks are the classes, which most of the time correspond 1-to-1 to files.

This means that our table will have one row per each source code file and as many columns as the metrics we want to compute for that file, plus one column with the number of post release bugs.

### Collecting git data

For computing most of the metrics we want to investigate (e.g., how many people changed a file in its entire history) we need to know the history of files. We can do so by analyzing the *versioning system*. In our case, Lucene has a Subversion repository, but a [git mirror](https://github.com/apache/lucene-solr.git) is also available. We use the git repository as it allows to have the entire history locally, thus making the computations faster.

We clone the repository. For this we use the python library 'sh'.

In [1]:
import sh

We start by cloning the repository

In [2]:
sh.git.clone("https://github.com/apache/lucene-solr.git")



And we make sure that we point our 'git' command to the right directory.

In [3]:
git = sh.git.bake(_cwd='lucene-solr')
git.status()

On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

To perform the replication, we could either reason in terms of releases (see [list of Lucene releases](http://archive.apache.org/dist/lucene/java/)), or we could just inspect the 'trunk' in the versioning system and start from a given date.

In this assignment, we go for the second option: We consider the 'trunk' (main branch in svn) and focus on a 6-month period in which we look at the bugs occurring to the files existing at that moment. Concerning bug data, you will consider a time window from Feb 01, 2015 to Jul 31, 2015.

Let's retrieve the list of files existing in the trunk on Feb 01, 2015.

In [119]:
shaFeb15 = (git("rev-list","-n 1","--before=\"2015-02-01 00:01\"","master")).stdout[:-1]
shaFeb15

'b5db48c783e9d0dc19c087101f03d8834b201106'

In [120]:
git.checkout(shaFeb15)
git.status()

[31mHEAD detached at [mb5db48c
nothing to commit, working directory clean

After getting the snapshot right, we will get all the java files inside the repository at a given snapshot. We do this using dict which has a key of full path and the corresponding package name and the filename.

need to check how to make a proper package name. i.e. how to extract org.apache.lucene from lucene/test-framework/src/java/org/apache/lucene/search/ShardSearchingTestBase.java

In [129]:
listfiles = git("ls-files").split("\n")

java_files = {}

for i in listfiles:
    if i[-5:] == ".java":
        split_file = i.split("/")
        pkg = '/'.join(split_file[:-1]) #need to checkout later
        filename = '/'.join(split_file[-1:]) 
        java_files[i] = [filename,pkg,0,0,0,0,0]
        
# we create a dictionary java_files[key] with each entry comes as a list [filename, package, minor, major, total, ownership, num_of_bugs]

Now I need to get the commit before Aug 01, 2015.

In [130]:
shaAug15 = (git("rev-list","-n 1","--before=\"2015-08-01 00:01\"","master")).stdout[:-1]
shaAug15

'7b412fdc630081ef8299952e1ea583eee5e89197'

In [131]:
git.checkout(shaAug15)
git.status()

[31mHEAD detached at [m7b412fd
nothing to commit, working directory clean

Now we need to get the bugs between 2015-02-01 00:02 until 2015-08-01 00:00. We will analyze the JIRA of lucene_solr project.

In [132]:
from jira import JIRA
jira = JIRA("https://issues.apache.org/jira/")
    
checkpoint = 100
total_issues = 0

issues = []

while checkpoint == 100:
    issues_page = jira.search_issues("(project=LUCENE or project=SOLR) and (created >= '2015-02-01 00:02' and created <='2015-08-01 00:00') and status = 'Closed'", startAt=total_issues, maxResults = 100)
    for i in issues_page:
        issues.append(i.key)
    checkpoint = len(issues_page)
    total_issues += checkpoint
    print total_issues

# up until this point, we already have a list of all bug's keys
# next we need to find this key in git log commit to find which file is affected by the corresponding bug
# let's make a function i.e. find_file(bug_key) 

100
200
300
400
500
600
700
733


In [133]:
import re

gitlog = (git("log","--after=\"2015-02-01 00:00\"","--before=\"2015-08-01 00:00\"","--format=%H#%s")).split("\n")
commit_to_bug = []
for counter in range(0,len(gitlog)-1):
    i = gitlog[counter]
    text_split = i.split("#")
    commit_id = text_split[0]
    bug_key = re.findall(r"(\bSOLR\b-[0-9]+|\bLUCENE\b-[0-9]+)",text_split[1])
    if bug_key != [] and bug_key[0] in issues:
        commit_to_bug.append([commit_id, bug_key[0]])

So, up until this point, we already have a mapping of commit_id and bugs inside the interval observed. Next step is to find the changed files for each commit and then compute the number of bugs for each file.

In [134]:
# git diff-tree --name-only -r commit_id
# for each files in all commit, find the changed files

java_files_copy = java_files

for i in commit_to_bug:
    commit_id_bug = i[0]
    changed_files = git("diff-tree","-r","--no-commit-id","--name-only",commit_id_bug)
    x = changed_files.split("\n")
    for y in x:
        if y in java_files:
            java_files_copy[y][5] += 1

In [135]:
#print java_files_copy

for key in java_files_copy:
    print key + "|" + java_files_copy[key][0] + "|" + java_files_copy[key][1] + "|" + str(java_files_copy[key][2])

lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/config/FieldBoostMapFCListener.java|FieldBoostMapFCListener.java|lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/config|0
lucene/suggest/src/java/org/apache/lucene/search/suggest/BitsProducer.java|BitsProducer.java|lucene/suggest/src/java/org/apache/lucene/search/suggest|0
lucene/queries/src/test/org/apache/lucene/queries/TermsQueryTest.java|TermsQueryTest.java|lucene/queries/src/test/org/apache/lucene/queries|0
lucene/analysis/common/src/java/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.java|EdgeNGramTokenizer.java|lucene/analysis/common/src/java/org/apache/lucene/analysis/ngram|0
solr/core/src/java/org/apache/solr/handler/component/RangeFacetRequest.java|RangeFacetRequest.java|solr/core/src/java/org/apache/solr/handler/component|0
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/nodes/WildcardQueryNode.java|WildcardQueryNode.java|lucene/queryparser