Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added a README with further instructions regarding the implementation. Added the MIT license.
- Loading branch information
0 parents
commit 8b94394
Showing
3 changed files
with
246 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
plan/*.aux | ||
plan/*.log | ||
plan/*.synctex.gz | ||
repos/* | ||
code-maat/* | ||
**.log.* | ||
**.log | ||
**/.ipynb_checkpoints | ||
**.retry | ||
**datasets | ||
*.json | ||
|
||
__pycache__ | ||
|
||
szz/bin/* | ||
szz/build/* | ||
szz/gradle | ||
szz/jars | ||
szz/gradlew | ||
szz/run | ||
szz/gerrit_all_commits_without_patchsets.json | ||
szz/gradlew.bat | ||
szz/bugReports.txt | ||
szz/historyCommits.txt | ||
szz/information | ||
szz/.gradle | ||
szz/results | ||
szz/libs | ||
szz/gradle.properties | ||
.gradle |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2018 Axis Communications AB | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
# SZZUnleashed | ||
|
||
A implementation of the SZZ algorithm as described by Zeller et al's in ["When Do | ||
Changes Induce Fixes?"](https://www.st.cs.uni-saarland.de/papers/msr2005/). It | ||
also contains further improvements as described by Williams et al's in [Szz | ||
revisited: verifying when changes induce fixes](https://www.researchgate.net/publication/220854597_SZZ_revisited_verifying_when_changes_induce_fixes). | ||
|
||
## What is the usage of this algorithm? | ||
|
||
The SZZ algorithm is used to find bug introducing commits from a set of bug | ||
fixing commits. The bug introducing commits can be extracted either from a bug | ||
tracking system such as JIRA or simply by searching for commits that states that | ||
they are fixing something. The found bug introducing commits can then be used to | ||
build datasets for machine learning purposes such as when buggy commits wants to | ||
be found. | ||
|
||
## Prerequisites: | ||
|
||
* Java 8 | ||
* Gradle | ||
|
||
## Usage SZZ algorithm | ||
|
||
### Grab issues ### | ||
The fetch script is an example of how one can extract issues from a bug tracking | ||
system. | ||
```python | ||
python fetch.py | ||
``` | ||
It creates a directory with issues. To convert these into a format where they can | ||
be processed, use: | ||
```python | ||
python git_log_to_array.py <path_to_local_repo> | ||
``` | ||
This creates a file `gitlog.json` that is used to link the issues to bug fixing | ||
commits. Using the `find_bug_fixes.py` and this file, we can get a json file | ||
that contains the Issue and its corresponding commit SHA-1, the commit date, | ||
the creation date and the resolution date. | ||
|
||
### Find the bug introducing commits ### | ||
|
||
This implementation works regardless which language and file type. It uses | ||
[JGIT](https://www.eclipse.org/jgit/) to parse a git repository. | ||
|
||
To build a runnable jar file, use the gradle build script in the szz directory | ||
like: | ||
|
||
```shell | ||
gradle build && gradle fatJar | ||
``` | ||
|
||
Or if the algorithm should be runned without building a jar: | ||
|
||
```shell | ||
gradle build && gradle runJar | ||
``` | ||
|
||
The algorithm tries to use as many cores as possible during runtime. The more | ||
the merrier so to say. | ||
|
||
To get the bug introducing commits from a repository using the file produced | ||
by the previous issue to bug fix commit step, run: | ||
|
||
```shell | ||
java -jar szz_find_bug_introducers-<version_number>.jar -i <path_to_issues> -r <path_to_local_repo> | ||
``` | ||
To assemble the results if the algorithm was able to use more than one core, | ||
run the `assembler.py` script on the results directory. | ||
|
||
## Output | ||
|
||
The output can then be seen in three different files commits.json, | ||
annotations.json and fix\_and\_bug\_introducing\_pairs.json. | ||
|
||
The commits.json file includes all commits that have been blamed to be bug | ||
introducing but which haven't been analyzed by any anything. | ||
|
||
The annotations.json is a representation of the graph that is generated by the | ||
algorithm in the blaming phase. Each bug fixing commit is linked to all possible | ||
commits which could be responsible for the bug. Using the improvement from | ||
Williams et al's, the graph also contains subgraphs which gives a deeper search | ||
for responsible commits. It enables the algorithm to blame other commits than | ||
just the one closest in history for a bug. | ||
|
||
Lastly, the fix\_and\_bug\_introducing\_pairs.json includes all possible pairs | ||
which could lead to a bug introduction and fix. This file is not sorted in any | ||
way and it includes doublettes when it comes to both introducers and fixes. A | ||
fix can be made several times and a introducer could be responsible for many | ||
fixes. | ||
|
||
## Feature Extraction ## | ||
Now that the potential bug introducing commits has been identified, the | ||
repository can be mined for features. | ||
|
||
### Code Churns ### | ||
The most simple features are the code churns. These are easily extracted by | ||
just parsing each diff for each commit. The ones that are extracted are: | ||
|
||
1. **Total lines of code** - Which simply is how many lines of code in total | ||
for all changed files. | ||
2. **Churned lines of code** - This is how many lines that have been inserted. | ||
3. **Deleted lines of code** - The number of deleted lines. | ||
4. **Number of Files** - The total number of changed files. | ||
|
||
To get these features, run: `python assemble_code_churns.py <path_to_repo> | ||
<branch>` | ||
|
||
### Diffusion Features ### | ||
The diffusion features are: | ||
|
||
1. The number of modified subsystems. | ||
2. The number of modified subdirectories. | ||
3. The entropy of the change. | ||
|
||
To extract the diffusion features, just run: | ||
`python assemble_diffusion_features.py --repository <path_to_repo> --branch <branch>` | ||
|
||
### Experience Features ### | ||
Maybe the most uncomfortable feature group. The experience features are the | ||
features that measures how much experience a developer has, both how recent | ||
but also how much experience the developer has overall with the code. | ||
|
||
The features are: | ||
|
||
1. Overall experience. | ||
2. Recent experience. | ||
|
||
The script builds a graph to keep track of each authors experience. So the intial | ||
run is: | ||
`python assemble_experience_features.py --repository <repo_path> --branch <branch> --save-graph` | ||
|
||
This will result in a graph which the script could use for future analysis | ||
|
||
To rerun the analysis without generating a new graph, just run: | ||
`python assemble_experience_features.py --repository <repo_path> --branch <branch>` | ||
|
||
### History Features ### | ||
The history are as follows: | ||
|
||
1. The number of authors in a file. | ||
2. The time between contributions made by the author. | ||
3. The number of unique changes between the last commit. | ||
|
||
The same as with the experience features, the script must initially generate a graph | ||
where the file meta data is saved. | ||
`python assemble_history_features.py --repository <repo_path> --branch <branch> --save-graph` | ||
|
||
To rerun the script without generating a new graph, use: | ||
`python assemble_history_features.py --repository <repo_path> --branch <branch>` | ||
|
||
### Purpose Features ### | ||
The purpose feature is just a single feature and that is if the commit is a fix o | ||
not. To extract it use: | ||
|
||
`python assemble_purpose_features.py --repository <repo_path> --branch <branch>` | ||
|
||
### Coupling ### | ||
A more complex number of features are the coupling features. These indicates | ||
how strong the relation is between files and modules for a revision. This means | ||
that two files can have a realtion even though they don't have a realtion | ||
inside the source code itself. So by mining these, features that gives | ||
indications in how many files that a commit actually has made changes to are | ||
found. | ||
|
||
The mining is made by a docker image containing the tool code-maat. | ||
|
||
These features takes long time to extract but is mined using: | ||
|
||
```python | ||
python assemble_features.py --image code-maat --repo-dir <path_to_repo> --result-dir <path_to_write_result> | ||
python assemble_coupling_features.py <path_to_repo> | ||
``` | ||
|
||
It is also possible to specify which commits to analyze. This is done with the | ||
CLI option `--commits <path_to_file_with_commits>`. The format of this file is | ||
just lines where each line is equal to the corresponding commit SHA-1. | ||
|
||
If the analyzation is made by several docker containers, one has to specify | ||
the `--assemble` option which stands for assemble. This will collect and store | ||
all results in a single directory. | ||
|
||
The script is capable of checking if the are any commits that haven't been | ||
analyzed. To do that, specify the `--missing-commits` option. | ||
|
||
## Classification ## | ||
Now that data has been assembled the training and testing of the ML model can | ||
be made. To do this, simply run the model script in the model directory: | ||
```python | ||
python model.py train | ||
``` | ||
|
||
## Authors | ||
|
||
[Oscar Svensson](mailto:wgcp92@gmail.com) | ||
[Kristian Berg](mailto:kristianberg.jobb@gmail.com) |