Skip to content

Commit

Permalink
Merge pull request #3 from vsoch/add/metrics
Browse files Browse the repository at this point in the history
adding first example of metrics extraction
  • Loading branch information
vsoch committed Dec 19, 2020
2 parents 4f58add + ba799f4 commit f378b7c
Show file tree
Hide file tree
Showing 5 changed files with 316 additions and 13 deletions.
78 changes: 75 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Caliper is a tool for measuring and assessing change in packages.
- **Manager** a handle to interact with a package manager
- **Extractor** a controller to use a manager to extract metrics of interest
- **Version repository** a repository created by an extractor that tagged commits for package releases
- **Metrics** are a type of classes that can extract a single timepoint, or a change over time (e.g., lines changed)

### Managers

Expand Down Expand Up @@ -76,8 +77,14 @@ This is the [metrics extractor](#metrics-extractor) discussed next.
### Metrics Extractor

Finally, a metrics extractor provides an easy interface to iterate over versions
of a package, and extract some kind of metric. For example, let's say we have
the Pypi manager above:
of a package, and extract some kind of metric. There are two ways to go about it -
starting with a repository that already has tags of interest, or starting
with a manager that will be used to create it.

#### Extraction Using Manager

The manager knows all the files for a release of some particular software, so
we can use it to start an extraction. For example, let's say we have the Pypi manager above:

```python
from caliper.managers import PypiManager
Expand Down Expand Up @@ -154,7 +161,72 @@ $ git tag

This is really neat! Next we can use the extractor to calculate metrics.

**under development**

#### Extraction from Existing

As an alternative, if you create a repository via a manager (or have another
repository you want to use that doesn't require one) you can simply provide the
working directory to the metrics extractor:

```python
from caliper.metrics import MetricsExtractor
extractor = MetricsExtractor(working_dir="/tmp/sregistry-j63wuvei")
```

You can see that we've created a git manager at this root:

```python
extractor.git
<caliper.managers.git.GitManager at 0x7ff92a66ca60>
```

And we then might want to see what metrics are available for extraction.

```python
extractor.metrics
{'changedlines': 'caliper.metrics.collection.changedlines.metric.Changedlines'}
```

Without going into detail, there are different base classes of metrics - a `MetricBase`
expects to extract some metric for one timepoint (a tag/commit) and a `ChangeMetricBase`
expects to extract metrics that compare two of these timepoints. The metric above
we see is a change metric. We can then run the extraction:

```python
extractor.extract_metric("changedlines")
```

Note that you can also extract all metrics known to the extractor.

```python
extractor.extract_all()
```

#### Parsing Results

For each extractor, you can currently loop through them and extract either
data on the level of individual files, or summary results:

```
for name, metric in extractor:
# Changedlines <caliper.metrics.collection.changedlines.metric.Changedlines at 0x7f7cd24f4940>

# A lookup of v1..v2 with a list of files
metric.get_file_results()

# A lookup of v1..v2 for summed changed
metric.get_summed_results()
```

For example, an entry in summed results might look like this:

```
{'0.2.34..0.2.35': {'size': 0, 'insertions': 4, 'deletions': 4, 'lines': 8}}
```

To say that between versions 0.2.34 and 0.2.35 there were 4 insertions, 4 deletions,
and 8 lines changed total, and there was no change in overall size.
We will eventually have more examples for how to parse and use this data.


## Use Cases
Expand Down
54 changes: 44 additions & 10 deletions caliper/metrics.py → caliper/metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@
__copyright__ = "Copyright 2020-2021, Vanessa Sochat"
__license__ = "MPL 2.0"

from caliper.managers.base import ManagerBase
from caliper.metrics.base import MetricFinder
from caliper.managers import GitManager
from caliper.utils.command import wget_and_extract
from caliper.logger import logger

import importlib
import tempfile
import os

Expand All @@ -24,19 +26,57 @@ class MetricsExtractor:
The source should be a url we can download with wget or similar.
"""

def __init__(self, manager):
def __init__(self, manager=None, working_dir=None):
self._metrics = {}
self._extractors = {}
self.manager = manager
self.tmpdir = None
self.git = None
if not isinstance(self.manager, ManagerBase):
raise ValueError("You must provide a caliper.manager subclass.")

# If we have a working directory provided, the repository exists
if working_dir:
self.tmpdir = working_dir
self.git = GitManager(self.tmpdir)

def __iter__(self):
for name, result in self._extractors.items():
yield name, result

@property
def metrics(self):
"""return a list of metrics available"""
if not self._metrics:
self._metrics_finder = MetricFinder()
self._metrics = dict(self._metrics_finder.items())
return self._metrics

def extract_all(self):
for name in self.metrics:
self.extract_metric(name)

def extract_metric(self, name):
"""Given a metric, extract for each commit from the repository."""
if name not in self.metrics:
logger.exit("Metric %s is not known." % name)

# If no git repository defined, prepare one
if not self.git:
self.prepare_repository()

module, metric_name = self._metrics[name].rsplit(".", 1)
metric = getattr(importlib.import_module(module), metric_name)()
metric.extract(self.git)
self._extractors[metric_name] = metric

def prepare_repository(self):
"""Since most source code archives won't include the git history,
we would want to create a root directly with a new git installation,
and then create tagged commits that correpond to each version. We
can then use this git repository to derive metrics of change.
"""
if not self.manager:
logger.exit("A manager is required to prepare a repository.")

# Create temporary git directory
self.tmpdir = tempfile.mkdtemp(prefix="%s-" % self.manager.name)
self.git = GitManager(self.tmpdir)
Expand Down Expand Up @@ -66,9 +106,3 @@ def prepare_repository(self):
# number of changed lines
# number of changed files
# new dependencies


def changed_lines(before, after):
"""given a file before and after, count the number of changed lines"""
# TODO, should be able to do this with git?
pass
97 changes: 97 additions & 0 deletions caliper/metrics/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
__author__ = "Vanessa Sochat"
__copyright__ = "Copyright 2020-2021, Vanessa Sochat"
__license__ = "MPL 2.0"

from abc import abstractmethod
from collections.abc import Mapping
from caliper.logger import logger
import os

here = os.path.abspath(os.path.dirname(__file__))


class MetricBase:
name = "metric"
description = "Extract a metric for a particular tag or commit"

@abstractmethod
def extract(self, git):
pass

@abstractmethod
def _extract(self, git, commit):
pass

@abstractmethod
def get_file_results(self):
pass

@abstractmethod
def get_summed_results(self):
pass


class ChangeMetricBase(MetricBase):

name = "changemetric"
description = "Extract a metric between two tags or commits"

@abstractmethod
def _extract(self, git, commit1, commit2):
pass


class MetricFinder(Mapping):
"""This is a metric cache (inspired by spack packages) that will keep
a cache of all installed metrics under caliper/metrics/collection
"""

_metrics = {}

def __init__(self, metrics_path=None):

# Default to the collection folder, add to metrics cache if not there
self.metrics_path = metrics_path or os.path.join(here, "collection")
self.update()

def update(self):
"""Add a new path to the metrics cache, if it doesn't exist"""
self._metrics = self._find_metrics()

def _find_metrics(self):
"""Find metrics based on listing folders under the metrics collection
folder.
"""
# Create a metric lookup dictionary
metrics = {}
for metric_name in os.listdir(self.metrics_path):
metric_dir = os.path.join(self.metrics_path, metric_name)
metric_file = os.path.join(metric_dir, "metric.py")

# Skip files in collection folder
if os.path.isfile(metric_dir):
continue

# Continue if the file doesn't exist
if not os.path.exists(metric_file):
logger.debug(
"%s does not appear to have a metric.py, skipping." % metric_dir
)
continue

# The class name means we split by underscore, capitalize, and join
class_name = "".join([x.capitalize() for x in metric_name.split("_")])
metrics[metric_name] = "caliper.metrics.collection.%s.metric.%s" % (
metric_name,
class_name,
)
return metrics

def __getitem__(self, name):
return self._metrics.get(name)

def __iter__(self):
return iter(self._metrics)

def __len__(self):
return len(self._metrics)
Empty file.
100 changes: 100 additions & 0 deletions caliper/metrics/collection/changedlines/metric.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
__author__ = "Vanessa Sochat"
__copyright__ = "Copyright 2020-2021, Vanessa Sochat"
__license__ = "MPL 2.0"

from caliper.metrics.base import ChangeMetricBase
import os

import git as gitpython

DATE_TIME_FORMAT = "%Y-%m-%dT%H:%M:%S%z"
EMPTY_SHA = "4b825dc642cb6eb9a060e54bf8d69288fbee4904"


class Changedlines(ChangeMetricBase):

name = "changedlines"
description = "count lines added and removed between versions"

def __init__(self):
self._data = {}

@property
def rawdata(self):
return self._data

def extract(self, git):
"""given a file before and after, count the number of changed lines"""
repo = gitpython.Repo(git.folder)
for tag in repo.tags:
parent = tag.commit.parents[0] if tag.commit.parents else EMPTY_SHA

# Derive the diff name
tag2 = "EMPTY" if isinstance(parent, str) else parent.message.strip()
index = "%s..%s" % (tag2, tag)

# A ChangeMetric stores tag diffs
self._data[index] = self._extract(git, tag.commit, parent)

def _extract(self, git, commit1, commit2):
"""The second commit should be the parent"""
diffs = {diff.a_path: diff for diff in commit1.diff(commit2)}
data = []

# The stats on the commit is a summary of all the changes for this
# commit, we'll iterate through it to get the information we need.
for filepath, stats in commit1.stats.files.items():

# Select the diff for the path in the stats
diff = diffs.get(filepath)

# Was the path renamed?
if not diff:
for diff in diffs.values():
if diff.b_path == git.folder and diff.renamed:
break

# Update the stats with the additional information
stats.update(
{
"object": os.path.join(git.folder, filepath),
"commit": commit1.hexsha,
"author": commit1.author.email,
"timestamp": commit1.authored_datetime.strftime(DATE_TIME_FORMAT),
"size": diff_size(diff),
}
)
if stats:
data.append(stats)

return data

def get_file_results(self):
"""return a lookup of changes, where each change has a list of files"""
return self._data

def get_summed_results(self):
"""Get summed values (e.g., lines changed) across files"""
results = {}
summary_keys = ["size", "insertions", "deletions", "lines"]
for index, items in self._data.items():
results[index] = dict((x, 0) for x in summary_keys)
for item in items:
for key in summary_keys:
results[index][key] += item.get(key, 0)
return results


def diff_size(diff):
"""Calculate the size of the diff by comparing blob size
Computes the size of the diff by comparing the size of the blobs.
"""
# New file
if not diff.a_blob and diff.new_file:
return diff.b_blob.size

# Deletion (should be negative)
if not diff.b_blob and diff.deleted_file:
return -1 * diff.a_blob.size

return diff.a_blob.size - diff.b_blob.size

0 comments on commit f378b7c

Please sign in to comment.