# bugfixtime
Predicting how many days it would take to fix a bug, given the Jira information when the bug is filed.

## Motivation
Bugs and software develeopment are inseparable. Also, having bugs in software products implies that at least some man-hour need to be assigned to address them. According to [a study in 2002 by America's National Institute of Standards and Technology (NIST)](http://www.abeacha.com/NIST_press_release_bugs_cost.htm), software bugs cost the U.S. economy an estimated **$59.5 billion** annually, or about 0.6 percent of the gross domestic product. It is almost impossible to expect no bug at all, hence, the second best alternative appears to be knowing how much work is needed to fix them so resources can be maanged efficiently.  
This project attempts to chip away at that problem by trying to predict how many days it would take to fix a bug, given their metadata when they are filed in Jira.

## Problem Statement
The motivation above helps to frame the problem statement into the following,  
*Given the metadata of the bug filed on Jira, predict how many days it would take to close/fix it.*

## Data
The data used in this study is obtained from the research article titled [From Reports to Bug-Fix Commits: A 10 Years Dataset of Bug-Fixing Activity from 55 Apache's Open Source Projects](https://dl.acm.org/doi/10.1145/3345629.3345639) by Vieira, Da Silva, Rocha, and Gomes in 2019. The data is housed [here](https://figshare.com/articles/Replication_Package_-_PROMISE_19/8852084).  
Contained within is a dataset composed of more than 70,000 bug-fix reports from 10 years of bug-fixing activity of 55 projects from the Apache Software Foundation along with their Jira data and status.

## Data Processing
Running `mining-script.py` as specified by the README file in the research article generates 3 CSV files for each of the 55 Apache Software project. The data used in here is from the `<projectname>-full-bug-fix-dataset` file from each project.

### Importing required packages

In [1]:
import matplotlib.pyplot as plt
from nltk.stem import PorterStemmer
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor 
from sklearn import preprocessing
import time

### Reading the file with project names
The CSV files generated by `mining-script.py` come in the format `<projectname>-jira-bug-fix-dataset`. Hence, the list of project names is needed to read the CSV files with different project name as prefixes. This list of project name can be found in the file `projects.csv`. However, the project names in the file `projects.csv` does not match the file names exactly and some names have to be added manually.

In [5]:
# read projects.csv
projects = pd.read_csv("data/projects.csv", delimiter=";")

# format the pandas series into list
projectsnames = list(projects["Name"].sort_values())

# missing names
missingnames = ["mng",
                "mrm",
                "dirkrb",
                "dirmina",
                "fc",
                "flink",
                "oozie",
                "hadoop",
                "hbase",
                "hdfs",
                "mapreduce",
                "tap5",
                "ww",
                "yarn"]

# add missing project names
projectsnames = projectsnames + missingnames

The list above is used to create CSV paths for each project's data. Reading the CSV from paths in the list then gives the complete dataset.

In [8]:
# create a list of CSV paths for the for loop below
csvpaths = []
stopwords = ["commons", "core", "mina"]
for name in projectsnames:
    name = name.lower()
    namesplit = name.split(" ")
    namesplit  = [word for word in namesplit if word not in stopwords]
    name = "".join(namesplit)
    path = "./bug-fix/" + name + "-jira-bug-fix-dataset.csv"
    csvpaths.append(path)

# for loop to read CSV, append to a dataframe and count the number of files read
filesread = 0
bugsdf = pd.DataFrame()
for path in csvpaths:
    try:
        somecsv = pd.read_csv(path, delimiter=";")
        filesread += 1
        bugsdf = bugsdf.append(somecsv, ignore_index=True)
    except FileNotFoundError:
        continue

print("files read: %i (expect 56)" % filesread)

files read: 0 (expect 56)


## Data cleaning
This section involves examining key aspects of the data such as column names and types, reformatting and cleaning them as necessary.

### Column names and type
Looking at the column names and their datatype can give a rough idea of what can be used as features for the prediction models. The details about what the columns mean can be found in the PDF accompanying the dataset.