# Where the Bugs are

Every time a bug is fixed, developers leave a trace – in the _version database_ when they commit the fix, or in the _bug database_ when they close the bug. In this chapter, we learn how to _mine these repositories_ for past changes and bugs, and how to _map_ them to individual modules and functions, highlighting those project components that have seen most changes and fixes over time.

In [None]:
from bookutils import YouTubeVideo
# YouTubeVideo("w4u5gCgPlmg")

**Prerequisites**

* You should have read [the chapter on tracking bugs](Tracking.ipynb).

In [None]:
import bookutils

In [None]:
import Tracking

## Synopsis
<!-- Automatically generated. Do not edit. -->

To [use the code provided in this chapter](Importing.ipynb), write

```python
>>> from debuggingbook.ChangeExplorer import <identifier>
```

and then make use of the following features.


_For those only interested in using the code in this chapter (without wanting to know how it works), give an example.  This will be copied to the beginning of the chapter (before the first section) as text with rendered input and output._


![](PICS/ChangeExplorer-synopsis-1.svg)



## Mining Change Histories

The history of any software project is a history of change. Any nontrivial project thus comes with a _version database_ to organize and track changes; and possibly also with an [issue database](Tracking.ipynb) to organize and track issues.

Over time, these databases hold plenty of information about the project: _Who changed what, when, and why?_ This information can be _mined_ from existing databases and _analyzed_ to answer questions such as

* Which parts in my project were most frequently or recently changed?
* How many files does the average change touch?
* Where in my project were the most bugs fixed?

To answer such questions, we can _mine_ change and bug histories for past changes and fixes. This involves digging through version databases such as `git` and [issue trackers such as RedMine or Bugzilla](Tracking.ipynb) and extracting all their information. Fortunately for us, there is ready-made infrastructure for some of this. 

## Mining with PyDriller

[PyDriller](https://pydriller.readthedocs.io/) is a Python package for mining change histories. Its `RepositoryMining` class takes a `git` version repository and allows to access all the individual changes ("modifications"), together with committers, affected files, commit messages, and more.

In [None]:
from pydriller import RepositoryMining  # https://pydriller.readthedocs.io/

To use `RepositoryMining`, we need to pass it 
* the URL of a `git` repository; or
* the directory name where a cloned `git` repository can be found.

In general, cloning a `git` repository locally (with `git clone URL`) and then analyzing it locally will be faster and require less network resources.

Let us apply `RepositoryMining` on the repository of this book. The function `current_repo()` returns the directory in which a `.git` subdirectory is stored – that is, the root of a cloned `git` repository.

In [None]:
import os

In [None]:
def current_repo():
    path = os.getcwd()
    while True:
        if os.path.exists(os.path.join(path, '.git')):
            return os.path.normpath(path)
        
        # Go one level up
        new_path = os.path.normpath(os.path.join(path, '..'))
        if new_path != path:
            path = new_path
        else:
            return None
    
    return None     

In [None]:
current_repo()

This gives us a repository miner for the book:

In [None]:
book_miner = RepositoryMining(current_repo())

`traverse_commits()` is a generator that returns one commit after another. Let us fetch the very first commit made to the book:

In [None]:
book_commits = book_miner.traverse_commits()
book_first_commit = next(book_commits)

Each commit has a number of attributes telling us more about the commit.

In [None]:
[attr for attr in dir(book_first_commit) if not attr.startswith('_')]

For instance, the `msg` attribute lets us know about the commit message:

In [None]:
book_first_commit.msg

whereas the `author` attribute gets us the name and email of the person who made the commit:

In [None]:
[attr for attr in dir(book_first_commit.author) if not attr.startswith('_')]

In [None]:
book_first_commit.author.name, book_first_commit.author.email

A commit consists of multiple _modifications_ to possibly multiple files. The commit `modifications` attribute returns a list of modifications.

In [None]:
book_first_commit.modifications

For each modification, we can retrieve the files involved as well as several statistics:

In [None]:
[attr for attr in dir(book_first_commit.modifications[0]) if not attr.startswith('_')]

Let us see which file was created with this modification:

In [None]:
book_first_commit.modifications[0].new_path

The `source_code` attribute holds the entire file contents after the modification.

In [None]:
print(book_first_commit.modifications[0].source_code)

We see that the `debuggingbook` project started with a very simple commit, namely the addition of an (almost empty) `README.md` file.

The attribute `source_code_before` holds the previous source code. We see that it is `None` – the file was just created.

In [None]:
print(book_first_commit.modifications[0].source_code_before)

Let us have a look at the _second_ commit. We see that it is much more substantial already.

In [None]:
book_second_commit = next(book_commits)

In [None]:
[m.new_path for m in book_second_commit.modifications]

We fetch the modification for the `README.md` file:

In [None]:
readme_modification = [m for m in book_second_commit.modifications if m.new_path == 'README.md'][0]

The `source_code_before` attribute holds the previous version (which we already have seen):

In [None]:
print(readme_modification.source_code_before)

The `source_code` attribute holds the new version – now a complete "README" file. (Compare this first version to the [current README text](index.ipynb).)

In [None]:
print(readme_modification.source_code[:400])

The `diff` attribute holds the differences between the old and the new version.

In [None]:
print(readme_modification.diff[:100])

The `diff_parsed` attribute even lists added and deleted lines:

In [None]:
readme_modification.diff_parsed['added'][:10]

With all this information, we can track all commits and modifications and establish statistics over which files were changed (and possibly even fixed) most. This is what we will do in the next section.

## Counting Changes

We start with a simple `ChangeCounter` class that, given a repository, counts for each file how frequently it was changed.

The constructor takes the repository to be analyzed and sets the internal counters:

In [None]:
class ChangeCounter:
    """Count the number of changes for a repository."""
    
    def __init__(self, repo, filter=None, log=False, **kwargs):
        """Constructor. `repo` is a git repository (as URL or directory).
`filter` is a predicate that takes a modification and returns True 
  if it should be considered (default: consider all).
`log` turns on logging if set.
`kwargs` are passed to the `RepositoryMining()` constructor."""
        self.repo = repo
        self.log = log

        if filter is None:
            filter = lambda m: True
        self.filter = filter

        # A node is an tuple (f_1, f_2, f_3, ..., f_n) denoting
        # a folder f_1 holding a folder f_2 ... holding a file f_n
        self.changes = {}    # Mapping node -> #of changes
        self.messages = {}   # Mapping node -> list of commit messages
        self.sizes = {}      # Mapping node -> last size seen
        self.hashes = set()  # All hashes already considered

        self.mine(**kwargs)

The method `mine()` does all the heavy lifting of mining. It retrieves all commits and all modifications from the repository, passing the modifications through the `update_stats()` method.

In [None]:
class ChangeCounter(ChangeCounter):
    def mine(self, **kwargs):
        """Gather data from repository. To be extended in subclasses."""
        miner = RepositoryMining(self.repo, **kwargs)

        for commit in miner.traverse_commits():
            for m in commit.modifications:
                m.hash = commit.hash
                m.committer = commit.committer
                m.committer_date = commit.committer_date
                m.msg = commit.msg

                if self.include(m):
                    self.update_stats(m)

The `include()` method allows to filter modifications. For simplicity, we copy the most relevant attributes of the commit over to the modification, such that the filter can access them, too.

In [None]:
class ChangeCounter(ChangeCounter):
    def include(self, m):
        """Return True if the modification `m` should be included
(default: the `filter` predicate given to the constructor).
To be overloaded in subclasses."""
        return self.filter(m)

The `update_stats()` method is the method that does the counting. It takes a modification converts the file name into a _node_ – a tuple $(f_1, f_2, ..., f_n)$ that denotes a _hierarchy_: Each $f_i$ is a directory holding $f_{i+1}$, with $f_n$ being the actual file. Here is what this notebook looks as a node:

In [None]:
tuple('debuggingbook/notebooks/ChangeExplorer.ipynb'.split('/'))

For each such node, `update_stats()` then invokes `update_size()`, `update_changes()`, and `update_elems()`.

In [None]:
class ChangeCounter(ChangeCounter):
    def update_stats(self, m):
        """Update counters with modification `m`. Can be extended in subclasses."""
        if not m.new_path:
            return

        node = tuple(m.new_path.split('/'))

        if m.hash not in self.hashes:
            self.hashes.add(m.hash)
            self.update_size(node, len(m.source_code) if m.source_code else 0)
            self.update_changes(node, m.msg)

        self.update_elems(node, m)

`update_size()` simply saves the last size of the item being modified. Since we progress from first to last commit, this reflects the size of the newest version.

In [None]:
class ChangeCounter(ChangeCounter):
    def update_size(self, node, size):
        """Update counters for `node` with `size`. Can be extended in subclasses."""
        self.sizes[node] = size

`update_changes()` increases the counter `changes` for the given node `node`, and adds the current commit message `commit_msg` to its list. This makes

* `size` a mapping of nodes to their size
* `changes` a mapping of nodes to the number of changes they have seen
* `commit_msg` a mapping of nodes to the list of commit messages that have affected them.

In [None]:
class ChangeCounter(ChangeCounter):
    def update_changes(self, node, commit_msg):
        """Update stats for `node` changed with `commit_msg`.
Can be extended in subclasses."""
        self.changes.setdefault(node, 0)
        self.changes[node] += 1

        self.messages.setdefault(node, [])
        self.messages[node].append(commit_msg)

The `update_elems()` method is reserved for later use, when we go and count fine-grained changes.

In [None]:
class ChangeCounter(ChangeCounter):
    def update_elems(self, node, m):
        """Update counters for subelements of `node` with modification `m`.
To be defined in subclasses."""
        pass

Let us put `ChangeCounter` to action – on the current (debuggingbook) repository.

In [None]:
DEBUGGINGBOOK_REPO = current_repo()

In [None]:
DEBUGGINGBOOK_REPO

You can also specify a URL instead, but this will access the repository via the network and generally be much slower.

In [None]:
# DEBUGGINGBOOK_REPO = 'https://github.com/uds-se/debuggingbook.git'

The function `debuggingbook_change_counter` instantiates a `ChangeCounter` class (or any subclass) with the debuggingbook repository, mining all the counters as listed above.

In [None]:
def debuggingbook_change_counter(cls):
    """Instantiate a ChangeCounter (sub)class `cls` with the debuggingbook repo"""
    
    def filter(m):
        """Do not include the `docs/` directory; it only holds Web pages"""
        return m.new_path and not m.new_path.startswith('docs/')

    return cls(DEBUGGINGBOOK_REPO, filter=filter)

Let us set `change_counter` to this `ChangeCounter` instance. This can take a few minutes.

In [None]:
from Timer import Timer

In [None]:
with Timer() as t:
    change_counter = debuggingbook_change_counter(ChangeCounter)

t.elapsed_time()

The attribute `changes` of our `ChangeCounter` now is a mapping of nodes to the respective number of changes. Here are the first 10 entries:

In [None]:
list(change_counter.changes.keys())[:10]

This is the number of changes to the `Chapters.makefile` file which lists the book chapters:

In [None]:
change_counter.changes[('Chapters.makefile',)]

The `messages` attribute holds all the messages:

In [None]:
change_counter.messages[('Chapters.makefile',)]

In [None]:
for node in change_counter.changes:
    assert len(change_counter.messages[node]) == change_counter.changes[node]

The `sizes` attribute holds the final size:

In [None]:
change_counter.sizes[('Chapters.makefile',)]

## Visualizing Past Changes

To explore the number of changes across all project files, we visualize them as a _tree map_. A tree map visualizes hierarchical data using nested rectangles. In our visualization, each directory is shown as a rectangle containing smaller rectangles. The _size_ of a rectangle is relative to its size (in bytes); and the _color_ of a rectangle is relative to the number of changes it has seen.

We use the [easyplotly](https://github.com/mwouts/easyplotly) package to easily create a treemap.

In [None]:
import easyplotly as ep
import plotly.graph_objects as go

In [None]:
import math

The method `map_node_sizes()` returns a size for the node – any number will do. By default, we use a logarithmic scale, such that smaller files are not totally visually eclipsed by larger files.

In [None]:
class ChangeCounter(ChangeCounter):
    def map_node_sizes(self):
        """Return a mapping of nodes to sizes. Can be overloaded in subclasses."""
        # Default: use log scale
        return {node: math.log(self.sizes[node]) if self.sizes[node] else 0
             for node in self.sizes}

        # Alternative: use sqrt size
        return {node: math.sqrt(self.sizes[node]) for node in self.sizes}

        # Alternative: use absolute size
        return self.sizes

The method `map_node_color()` returns a color for the node – again, as a number. The smallest and largest numbers returned indicate beginning and end in the given color scale, respectively.

In [None]:
class ChangeCounter(ChangeCounter):
    def map_node_color(self, node):
        """Return a color of the node, as a number. Can be overloaded in subclasses."""
        if node and node in self.changes:
            return self.changes[node]
        return None

The method `map_node_text()` shows a text to be displayed in the rectangle; we set this to the number of changes.

In [None]:
class ChangeCounter(ChangeCounter):
    def map_node_text(self, node):
        """Return the text to be shown for the node (default: #changes). 
Can be overloaded in subclasses."""
        if node and node in self.changes:
            return self.changes[node]
        return None

The methods `map_hoverinfo()` and `map_colorscale()` set additional map parameters. For details, see the [easyplotly](https://github.com/mwouts/easyplotly) documentation.

In [None]:
class ChangeCounter(ChangeCounter):
    def map_hoverinfo(self):
        """Return the text to be shown when hovering over a node."""
        return 'label+text'

    def map_colorscale(self):
        """Return the colorscale for the map."""
        return 'YlOrRd'

With all this, the `map()` function creates a tree map of the repository, using the  [easyplotly](https://github.com/mwouts/easyplotly) `Treemap` constructor.

In [None]:
class ChangeCounter(ChangeCounter):
    def map(self):
        """Produce an interactive tree map of the repository."""
        treemap = ep.Treemap(
                     self.map_node_sizes(),
                     text=self.map_node_text,
                     hoverinfo=self.map_hoverinfo(),
                     marker_colors=self.map_node_color,
                     marker_colorscale=self.map_colorscale(),
                     root_label=self.repo,
                     branchvalues='total'
                    )

        fig = go.Figure(treemap)
        fig.update_layout(margin=dict(l=0, r=0, t=30, b=0))

        return fig

This is what the tree map for `debuggingbook` looks like. 

* Click on any rectangle to enlarge it.
* Click outside of the rectangle to return to a wider view.
* Hover over a rectangle to get further information.

In [None]:
change_counter = debuggingbook_change_counter(ChangeCounter)

In [None]:
change_counter.map()

We can easily identify the most frequently changed files:

In [None]:
all_nodes = list(change_counter.changes.keys())
all_nodes.sort(key=lambda node: change_counter.changes[node], reverse=True)
[(node, change_counter.changes[node]) for node in all_nodes[:4]]

In [None]:
# ignore
all_notebooks = [node for node in change_counter.changes.keys()
                 if len(node) == 2 and node[1].endswith('.ipynb')]
all_notebooks.sort(key=lambda node: change_counter.changes[node], reverse=True)

In [None]:
from bookutils import quiz

In [None]:
quiz("Which two notebooks have seen the most changes over time?",
    [
        f"`{all_notebooks[3][1].split('.')[0]}`",
        f"`{all_notebooks[1][1].split('.')[0]}`",
        f"`{all_notebooks[2][1].split('.')[0]}`",
        f"`{all_notebooks[0][1].split('.')[0]}`",
    ], [1234 % 3, 3702 / 1234])

Indeed, these two are the two most frequently changed notebooks:

In [None]:
all_notebooks[0][1].split('.')[0], all_notebooks[1][1].split('.')[0]

## Past Fixes

Knowing which files have been changed most is useful in debugging, because any change increases the chance to introduce a new bug. Even more important, however, is the question of how frequently a file was _fixed_ in the past, as this is an important indicator for its bug-proneness.

(One may think that fixing several bugs _reduces_ the number of bugs, but unfortunately, a file which has seen several fixes in the past is likely to see fixes in the future, too. This is because the bug-proneness of a software component very much depends on the requirements it has to fulfill, and if these requirements are unclear, complex, or frequently change, this translates into many fixes.)

How can we tell _changes_ from _fixes_? 

* One indicator is _commit messages_:
  If they refer to "bugs" or "fixes", then the change is a fix.
* Another indicator is _bug numbers_:
  If a commit message contains an issue number from an associated issue database, then we can make use of the issue referred to.
    * The issue database may provide us with additional information about the bug, such as its severity, how many people it was assigned to, how long it took to fix it, and more.
* A final indicator is _time_:
  If a developer first committed a change and in the same time frame marked an issue as "resolved", then it is likely that the two refer to each other.

The way these two are linked very much depends on the project – and the discipline of developers as it comes to change messages. _Branches_ and _merges_ bring additional challenges.

For the `debuggingbook` project, identifying fixes is easy. The discipline is that if a change fixes a bug, it is prefixed with `Fix:`. We can use this to introduce a `FixCounter` class specific to our `debuggingbook` project.

In [None]:
class FixCounter(ChangeCounter):
    def include(self, m):
        """Include all modifications whose commit messages start with 'Fix:'"""
        return super().include(m) and m and m.msg.startswith("Fix:")

As a twist to our default `ChangeCounter` class, we include the "fix" messages in the tree map rectangles.

In [None]:
class FixCounter(FixCounter):
    def map_node_text(self, node):
        if node and node in self.messages:
            return "<br>".join(self.messages[node])
        return ""

    def map_hoverinfo(self):
        return 'label'

This is the tree map showing fixes. We see that 
* only those components that actually have seen a fix are shown; and
* the fix distribution differs from the change distribution.

In [None]:
fix_counter = debuggingbook_change_counter(FixCounter)

In [None]:
fix_counter.map()

## Fine-Grained Changes

Our aim: get tuples that are go beyond just files.

* For each file, get a listing of which elements are in which lines.
* Then, for each diff, find out which elems are affected.

### Mapping Elements to Locations

In [None]:
import re

In [None]:
import magic  # https://github.com/ahupp/python-magic

In [None]:
magic.from_buffer('''
#include <stdio.h>

int main(int argc, char *argv[]) {
    printf("Hello, world!\n")
}
''')

 Also see http://rigaux.org/language-study/syntax-across-languages.html

In [None]:
DELIMITERS = [
    (
        # Python
        re.compile(r'^python.*'),

        # Beginning of element
        re.compile(r'^(async\s+)?(def|class)\s+(?P<name>\w+)\W.*'),

        # End of element
        re.compile(r'^[^#\s]')
    ),
    (
        # Jupyter Notebooks
        re.compile(r'^(json|exported sgml|jupyter).*'),
        re.compile(r'^\s+"(async\s+)?(def|class)\s+(?P<name>\w+)\W'),
        re.compile(r'^(\s+"[^#\s\\]|\s+\])')
    ),
    (
        # C source code (actually, any { }-delimited language)
        re.compile(r'^(c |c\+\+|c#|java|perl|php).*'),
        re.compile(r'^[^\s].*\s+(?P<name>\w+)\s*[({].*'),
        re.compile(r'^[}]')
    )
]

In [None]:
def rxdelim(s):
    tp = magic.from_buffer(s).lower()
    for rxtp, rxbegin, rxend in DELIMITERS:
        if rxtp.match(tp):
            return rxbegin, rxend

    return None, None

In [None]:
def elem_mapping(s, log=False):
    rxbegin, rxend = rxdelim(s)
    if rxbegin is None:
        return []

    mapping = [None]
    current_elem = None
    lineno = 0

    for line in s.split('\n'):
        lineno += 1

        match = rxbegin.match(line)
        if match:
            current_elem = match.group('name')
        elif rxend.match(line):
            current_elem = None

        mapping.append(current_elem)

        if log:
            print(f"{lineno:3} {current_elem}\t{line}")

    return mapping

In [None]:
some_c_source = """
#include <stdio.h>

int foo(int x) {
    return x;
}

struct bar {
    int x, y;
}

int main(int argc, char *argv[]) {
    return foo(argc);
}

"""
some_c_mapping = elem_mapping(some_c_source, log=True)

In [None]:
some_python_source = """
def foo(x):
    return x

class bar(blue):
    x = 25
    def f(x):
        return 26

def main(argc):
    return foo(argc)

"""
some_python_mapping = elem_mapping(some_python_source, log=True)

In [None]:
# some_jupyter_source = open("Slicer.ipynb").read()
# some_jupyter_mapping = elem_mapping(some_jupyter_source, log=False)

### Determining Changed Elements

In [None]:
from ChangeDebugger import diff  # minor dependency

In [None]:
from diff_match_patch import diff_match_patch

In [None]:
class FineChangeCounter(ChangeCounter):
    def changed_elems(self, mapping, start, length=0):
        elems = set()
        for line in range(start, start + length + 1):
            if line < len(mapping) and mapping[line]:
                elems.add(mapping[line])

        return elems

    def elem_size(self, elem, mapping, source):
        source_lines = [''] + source.split('\n')
        size = 0

        for line_no in range(len(mapping)):
            if mapping[line_no] == elem:
                size += len(source_lines[line_no])

        return size

In [None]:
fine_change_counter = debuggingbook_change_counter(FineChangeCounter)

In [None]:
assert fine_change_counter.changed_elems(some_python_mapping, 4) == {'foo'}

In [None]:
assert fine_change_counter.changed_elems(some_python_mapping, 4, 1) == {'foo', 'bar'}

In [None]:
assert fine_change_counter.changed_elems(some_python_mapping, 10, 2) == {'main'}

In [None]:
class FineChangeCounter(FineChangeCounter):
    def update_elems(self, node, m):
        old_source = m.source_code_before if m.source_code_before else ""
        new_source = m.source_code if m.source_code else ""
        patches = diff(old_source, new_source)

        old_mapping = elem_mapping(old_source)
        new_mapping = elem_mapping(new_source)

        elems = set()

        for patch in patches:
            old_start_line = patch.start1 + 1
            new_start_line = patch.start2 + 1

            for (op, data) in patch.diffs:
                data_length = data.count('\n')

                if op == diff_match_patch.DIFF_INSERT:
                    elems |= self.changed_elems(old_mapping, old_start_line)
                    elems |= self.changed_elems(new_mapping, new_start_line,
                                                 data_length)
                elif op == diff_match_patch.DIFF_DELETE:
                    elems |= self.changed_elems(old_mapping, old_start_line, 
                                                 data_length)
                    elems |= self.changed_elems(new_mapping, new_start_line)

                old_start_line += data_length
                new_start_line += data_length

        for elem in elems:
            elem_node = node + (elem,)

            self.update_size(elem_node,
                             self.elem_size(elem, new_mapping, new_source))
            self.update_changes(elem_node, m.msg)

In [None]:
with Timer() as t:
    fine_change_counter = debuggingbook_change_counter(FineChangeCounter)

t.elapsed_time()

In [None]:
fine_change_counter.map()

## Synopsis

_For those only interested in using the code in this chapter (without wanting to know how it works), give an example.  This will be copied to the beginning of the chapter (before the first section) as text with rendered input and output._

In [None]:
# ignore
from ClassDiagram import display_class_hierarchy

In [None]:
# ignore
display_class_hierarchy([FineChangeCounter, FixCounter],
                        public_methods=[
                            ChangeCounter.__init__,
                            ChangeCounter.map,
                            ChangeCounter.include,
                            ChangeCounter.map_hoverinfo,
                            ChangeCounter.map_colorscale,
                            ChangeCounter.map_node_sizes,
                            ChangeCounter.map_node_text,
                            ChangeCounter.update_elems,
                            ChangeCounter.update_size,
                            ChangeCounter.update_changes,
                            FineChangeCounter.include,
                            FixCounter.include
                        ],
                        project='debuggingbook')

## Lessons Learned

* _Lesson one_
* _Lesson two_
* _Lesson three_

## Next Steps

_Link to subsequent chapters (notebooks) here, as in:_

## Background

_Cite relevant works in the literature and put them into context, as in:_

The idea of ensuring that each expansion in the grammar is used at least once goes back to Burkhardt \cite{Burkhardt1967}, to be later rediscovered by Paul Purdom \cite{Purdom1972}.

## Exercises

_Close the chapter with a few exercises such that people have things to do.  To make the solutions hidden (to be revealed by the user), have them start with_

```
**Solution.**
```

_Your solution can then extend up to the next title (i.e., any markdown cell starting with `#`)._

_Running `make metadata` will automatically add metadata to the cells such that the cells will be hidden by default, and can be uncovered by the user.  The button will be introduced above the solution._

### Exercise 1: _Title_

_Text of the exercise_

In [None]:
# Some code that is part of the exercise
pass

_Some more text for the exercise_

**Solution.** _Some text for the solution_

In [None]:
# Some code for the solution
2 + 2

_Some more text for the solution_

### Exercise 2: _Title_

_Text of the exercise_

**Solution.** _Solution for the exercise_