<br>
<h1 style = "text-align: center; color: #4d79ff;">Bug Fixing Identification</h1>
<h1 style = "font-size: 20px; text-align: center; color: #6d9cff;">Taha Shabani</h1>
<br>

<h2>Definition of project:</h2>
<p style = "font-size: 14px">This Project aims to create a dataset that contains: (1) the bug id from the ITS, (2) the bug description from the ITS, (3) the bug-fixing change hash from the SCM, and (3) the BFC message from the
SCM.
<br>
<b>Note:</b>
<p style="text-indent :2em;">1. <mark>Issue Tracking System (ITS)</mark>: ActiveMQ -> JIRA (https://issues.apache.org/jira/)</p>
<p style="text-indent :2em;">1. <mark>Source code management system (SCM)</mark>: ActiveMQ -> Git (https://github.com/apache/ActiveMQ/)</p>
</p>

<h2>Main Parts</h2>
<p style="text-indent :2em;"><a href="#Data-mining"> 1. Data Mining using <mark>pydriller</mark> and <mark>jira</mark> libraries.</a></p>
<p style="text-indent :2em;"><a href="#Extracting-commits-related-to-bugs"> 2. Extracting Commits related to bugs.</a></p>
<p style="text-indent :2em;"><a href="#Exporting-output"> 3. Export output </a></p>

<br>

In [1]:
from pydriller import Repository
import csv
import os.path
import pandas as pd
from jira import JIRA


<h3 style="color: #1a8cff;">Data mining</h3>
<h4> 1. SCM: </h4>
<p>
    To Extract SCM data, we used pydriller library and fetched <mark>hash</mark>, <mark>msg</mark>, and<mark>insertions</mark> of those commits which only contain insertions (e.g. deletions = 0).
</p>
<br>

In [2]:
repoPath = 'https://github.com/apache/ActiveMQ'
commitsFile = 'commits.csv'
commits = []

if not(os.path.isfile(commitsFile)) :
    with open(commitsFile, 'a') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=['hash', 'msg', 'insertions'])
        writer.writeheader()
        
        for commit in Repository(repoPath).traverse_commits():
            if commit.deletions == 0 and commit.insertions > 0 : #commits which only insert new lines
                record = {'hash': commit.hash, 'msg': commit.msg, 'insertions': commit.insertions}
                writer.writerow(record)

<h4> 1. ITS: </h4>
<p>
    To Extract ITS issues, we used jira library and fetched only those issues with issuetype = bug.
</p>
<p>
    Then we excluded unresolved bugs. (Resolved bugs: <mark>status: Resolved</mark> or <mark>resolution: Fixed</mark>)
</p>
<br>

In [3]:
jira = JIRA(server="https://issues.apache.org/jira")

def getAllIssues(jiraClient, projectName, fields):
    issues = []
    i = 0
    chunkSize = 100
    while True:
        chunk = jiraClient.search_issues(f'project = {projectName}', startAt=i, 
                                         maxResults=chunkSize, fields=fields)
        i += chunkSize
        issues += chunk.iterable
        if i >= chunk.total:
            break
    return issues

In [4]:
import warnings
warnings.filterwarnings('ignore')

jiraIssuesFile = 'jiraIssues.csv'

if not(os.path.isfile(jiraIssuesFile)) :
    issues = getAllIssues(jira, 'ActiveMQ', ['id', 'resolution', 'status', 'issuetype', 'description'])
    
    with open(jiraIssuesFile, 'a') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=['id', 'key', 'status', 'issuetype', 'resolution', 'description'])
        writer.writeheader()  
        
        for issue in issues:
            if issue.raw['fields']['issuetype']['name'] == 'Bug':
                
                record = {'id': issue.id, 'key': issue.key, 
                          'status': issue.raw['fields']['status']['name'], 
                          'issuetype': issue.raw['fields']['issuetype']['name'],
                          'resolution': "None" if (issue.raw['fields']['resolution'] is None) else issue.raw['fields']['resolution']['name'],
                          'description': issue.raw['fields']['description']
                         }
                writer.writerow(record)

In [5]:
commitDf = pd.read_csv(commitsFile, dtype={"hash": "string", "msg": "string", "insertions": "int"})
commitDf.head()

Unnamed: 0,hash,msg,insertions
0,40a7d3b6ac35d2ecb34e85fc3403d2e48e33874e,Moved the trunk code into the trunk sub direct...,194375
1,8f1763f078525b3cbfd30dd8389d3f61da56ac78,Moved the trunk code into the trunk sub direct...,899
2,262a5596d9300b7aded14d550cf8f5ee80d7ac0f,optimisation; if a JMS exception has already b...,6
3,1835e4536984abac76c6d4317a73f1a7e851e960,added some helper methods to make it easy to s...,28
4,e1cfbad4bcb9d3342da0b198798db120914a1976,added test case for using the BrokerService wi...,29


In [6]:
bugDf = pd.read_csv(jiraIssuesFile, dtype={"id": "string", "resolution": "string", "status": "string", 
                                           "issuetype": "string", "description": "string"})
bugDf = bugDf.loc[(bugDf['status'] == 'Resolved') | (bugDf['resolution'] == 'Fixed')]
bugDf.head()

Unnamed: 0,id,key,status,issuetype,resolution,description
4,13407920,AMQ-8408,Resolved,Bug,Fixed,
6,13404313,AMQ-8395,Resolved,Bug,Fixed,{noformat} AdvisoryBroker |...
7,13402290,AMQ-8389,Closed,Bug,Fixed,We’ve found the following broken URls on the A...
8,13401461,AMQ-8386,Resolved,Bug,Fixed,We’ve identified the following broken URLs on ...
17,13395929,AMQ-8357,Closed,Bug,Fixed,The download page [1] refers several times to ...



<h3 style="color: #1a8cff;">Extracting commits related to bugs</h3>

We iterate over bugs and find commits containing the key per each bug.
<p>1. <b>getKeyMutations()</b>: Per each bug's key we consider two mutations: with and without dash. For example: if <mark>key=AMQ-1111</mark>, then we consider both <mark>AMQ-1111</mark> and <mark>AMQ1111</mark> respectively.</p>
<p>2. <b>isMatchedWithKey()</b>: consider a situation that our key is <mark>AMQ-111</mark>. When we check this substring in commit messages, all the following strings would contain our key: <mark>AMQ-1110</mark>, <mark>AMQ-1111</mark>, and so on. So in order to distinguish between the correct and incorrect ones, we check the next character of our key in commit message (if it's available) and check whether it is a digit or not.</p>
</p>
<p>3. We also exclude <mark>Merge pull request</mark> commits.</p>
<br>

In [7]:
def getKeyMutations(bugKey):
    return [bugKey, bugKey.replace('-', '')]

def isMatchedWithKey(message, keys):
    for key in keys:
        splittedMessage = message.split(key,1)
        if len(splittedMessage) == 2 and (splittedMessage[1] == '' or (not splittedMessage[1][0].isdigit())):
            return True
    return False
        
def findExactKeys(relevantCommits, keys):
    relevantCommits['exactBugKey'] = relevantCommits['msg'].apply(lambda m: isMatchedWithKey(m, keys))
    relevantCommits = relevantCommits.loc[relevantCommits['exactBugKey'] == True]

In [8]:
def getResultRows(bugId, bugKey, bugDesc, relCommits):
    result = []
    for index, commit in relCommits.iterrows():
        result.append({"bugId": bugId, "bugKey": bugKey, "bugDesc": bugDesc, "hash": commit.hash, "msg": commit.msg})
    return result

In [9]:
results = []
excMergeCommits = commitDf.loc[commitDf['msg'].str.contains("Merge pull request") == False]

for index, bug in bugDf.iterrows():
    keyMutations = getKeyMutations(bug.key)
    relevantCommits = excMergeCommits.loc[excMergeCommits['msg'].str.contains(keyMutations[0]) | excMergeCommits['msg'].str.contains(keyMutations[1])]
    
    if not(relevantCommits.empty):
        findExactKeys(relevantCommits, keyMutations)
        results.extend(getResultRows(bug.id, bug.key, bug.description, relevantCommits))


<h3 style="color: #1a8cff;">Exporting output</h3>
<br>

In [10]:
resultFile = "BFC.csv"
with open(resultFile, 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=['bug id', 'bug key', 'bug description', 'commit hash', 'commit msg'])
        writer.writeheader()
        
        for row in results:
            record = {'bug id': row['bugId'], 'bug key': row['bugKey'], 'bug description': row['bugDesc'], 
                      'commit hash': row['hash'], 'commit msg': row['msg']}
            writer.writerow(record)