Skip to content

Commit

Permalink
Instrument libraries (distributions) rather than packages
Browse files Browse the repository at this point in the history
To answer the questions we have, we need data on libraries
installed in the environment, not packages that are
imported. importlib_metadata gives us access to the
RECORDS file (https://www.python.org/dev/peps/pep-0376/#record)
for every package, and we build a reverse mapping of
package name -> distribution once. Distribution
(I am calling them libraries) names are then used.

Since distributions are 'installed' specifically, they
already ignore modules in the standard library and any
local user written modules. The dependency on the stdlib_list
library can be removed.

All metric names have been changed to talk about libraries,
not packages.

The word 'package' is so overloaded, and nobody knows anything about
'distributions'. https://packaging.python.org/glossary/ is somewhat
helpful. I will now try to use just modules (something that can be
imported - since our source is sys.modules) and "library" (what is
installed with pip or conda - aka a distribution).

Despite what python/importlib_metadata#131
says, the package_distributions function in importlib_metadata relies
on the undocumented `top_level.txt` file, and does not work with
anything not using setuptools. So we go through all the
RECORDs ourselves.

Added some unit tests, and refactored some functions to make them
easier to test. Import-time side effects are definitely harder to
test, so I now require an explicit setup function call. This makes
testing much easier, and is also more intuitive.

Bump version number
  • Loading branch information
yuvipanda committed Jul 10, 2021
1 parent e53579d commit f3676d4
Show file tree
Hide file tree
Showing 5 changed files with 228 additions and 64 deletions.
78 changes: 40 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# python-popularity-contest

In interactive computing installations, figuring out which python
modules are in use is extremely helpful in managing environments
libraries are in use is extremely helpful in managing environments
for users.

On import, this module will setup an `atexit` hook, which will
send the list of imported modules to a statsd server for aggregation.
python-popularity-contest collects pre-aggregated, anonymized data
on which installed libraries are being actively used by your users.

Named after the [debian popularity contest](https://popcon.debian.org/)

## What data is collected?

We want to collect just enough data to help with the following tasks:

1. Remove unused packages that have never been imported. These can
1. Remove unused library that have never been imported. These can
probably be removed without a lot of breakage for individual
users

2. Provide aggregate statistics about the 'popularity' of a package
to add a data point for understanding how important a particular package is
2. Provide aggregate statistics about the 'popularity' of a library
to add a data point for understanding how important a particular library is
to a group of users. This can help with funding requests, better
training recommendations, etc.

Expand All @@ -27,14 +27,15 @@ source. Only overall global counts are stored, without any individual
record of each source. This is much better than storing per-user or
per-process records.

The data we have will be a time series for each package, representing
the cumulative count of processes where this package was imported. This
functions as a [prometheus counter](https://prometheus.io/docs/concepts/metric_types/#counter),
which is how eventually queries are written.
The data we have will be a time series for each library, representing the
cumulative count of processes where any module from this library was imported.
This is designed as a [prometheus
counter](https://prometheus.io/docs/concepts/metric_types/#counter), which is
how eventually queries are written.

## Collection infrastructure

The package emits metrics over the [statsd](https://github.com/statsd/statsd)
`popularity_contest` emits metrics over the [statsd](https://github.com/statsd/statsd)
protocol, so you need a statsd server running to collect and aggregate
this information. Since statsd only stores global aggregate counts, we
never collect data beyond what we need.
Expand All @@ -46,18 +47,18 @@ The recommended collection pipeline is:

A [mapping rule](https://github.com/prometheus/statsd_exporter#glob-matching)
to convert the statsd metrics into usable prometheus metrics, with
helpful labels for package names. Instaed of many metrics named like
`python_popcon_imported_package_<package-name>`, we can get a better
`python_popcon_imported_package{package="<package-name>}`. A mapping
helpful labels for library names. Instaed of many metrics named like
`python_popcon_library_used_<library-name>`, we can get a better
`python_popcon_library_used{library="<library-name>"}`. A mapping
rule that works with the default statsd metric name structure would
look like:

```yaml
mappings:
- match: "python_popcon.imported_package.*"
name: "python_popcon_imported_package"
- match: "python_popcon.library_used.*"
name: "python_popcon_library_used"
labels:
package: "$1"
library: "$1"
```

You can add additional labels here if you would like.
Expand All @@ -83,10 +84,10 @@ service:
statsd:
mappingConfig: |-
mappings:
- match: "python_popcon.imported_package.*"
name: "python_popcon_imported_package"
- match: "python_popcon.library_used.*"
name: "python_popcon_library_used
labels:
package: "$1"
library: "$1"
```

## Installing
Expand All @@ -104,15 +105,16 @@ It must be installed in the environment we want instrumented.

### Activation

Once installed, the `popularity_contest.reporter` module needs to
be imported for reporting to be enabled. You can enable reporting
for all IPython sessions (and hence Jupyter Notebook sessions)
with an [IPython startup script](https://switowski.com/blog/ipython-startup-files).
After installation, the popularity_contest reporter must be explicitly
set up. You can enable reporting for all IPython sessions (and hence Jupyter
Notebook sessions) with an [IPython startup
script](https://switowski.com/blog/ipython-startup-files).

The startup script just needs one line:

```python
import popularity_contest.reporter
popularity_contest.reporter.setup_reporter()
```

Since the instrumentation is usually set up by an admin and not
Expand All @@ -123,10 +125,10 @@ conda environment installed in `/opt/conda`, you can put the file in
way, it also gets loaded before any user specific IPython startup
scripts.

Only packages imported *after* `popularity_contest.reporter`
was imported will be counted. This reduces noise from baseline
packages (like `IPython` or `six`) that are used invisibly by
everyone.
Only modules imported *after* the reporter is set up with
`popularity_contest.reporter.setup_reporter()` will be counted. This reduces
noise from baseline libraries (like `IPython` or `six`) that are used invisibly
by everyone.

### Statsd server connection info

Expand All @@ -139,9 +141,9 @@ to be set.
to. With the recommended `prometheus_statsd` setup, this will be
`9125`.
3. `PYTHON_POPCONTEST_STATSD_PREFIX` - the prefix each statsd metric
will have, defaults to `python_popcon.imported_package`. So
will have, defaults to `python_popcon.library_used`. So
each metric in statsd will be of the form
`python_popcon.imported_package.<package-name>`.
`python_popcon.library_used.<library-name>`.

You can put additional information in this prefix, and use that
to extract more labels in prometheus. For example, in a
Expand All @@ -155,23 +157,23 @@ to be set.
import os
pod_namespace = os.environ['POD_NAMESPACE']
c.KubeSpawner.environment.update({
'PYTHON_POPCONTEST_STATSD_PREFIX': f'python_popcon.namespace.{pod_namespace}.imported_package'
'PYTHON_POPCONTEST_STATSD_PREFIX': f'python_popcon.namespace.{pod_namespace}.library_used'
})
```

A mapping rule can be added to `prometheus_statsd` to extract the namespace.

```yaml
mappings:
- match: "python_popcon.namespace.*.imported_package.*"
name: "python_popcon_imported_package"
- match: "python_popcon.namespace.*.library_used.*"
name: "python_popcon_library_used"
labels:
namespace: "$1"
package: "$2"
library: "$2"
```

The prometheus metrics produced out of this will be of the form
`python_popcon_imported_package{package="<package-name>", namespace="<namespace>}`
`python_popcon_library_used{library="<library-name>", namespace="<namespace>}`

## Privacy

Expand All @@ -181,9 +183,9 @@ private information (like usernames tied to activity times, etc).

However, side channel attacks are still possible if the entire
set of timeseries data is available. Individual users might have specific
patterns of packages they use, and this might be discernable with enough
analysis. If some packages are uniquely used only by particular users,
patterns of modules they use, and this might be discernable with enough
analysis. If some libraries are uniquely used only by particular users,
this analysis becomes easier. Further aggregation of the data, redaction
of information about packages that don't have a lot of use, etc are methods
of information about modules that don't have a lot of use, etc are methods
that can be used to further anonymize this dataset, based on your needs.

3 changes: 3 additions & 0 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pytest
pytest-cov
pytest-mock
107 changes: 83 additions & 24 deletions popularity_contest/reporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,53 +6,112 @@
for users.
On import, this module will setup an `atexit` hook, which will
send the list of imported modules to a statsd server for aggregation.
send the list of distributions (libraries) from which packages
have been imported. stdlib and local modules are ignord.
"""
import sys
import atexit
import os
from stdlib_list import stdlib_list
from statsd import StatsClient
from importlib_metadata import distributions

# Make a copy of packages that have already been loaded
# until this point. These will not be reported to statsd,
# since these are 'infrastructure' packages that are needed
# by everyone, regardless of the specifics of the code being
# written.
ORIGINALLY_LOADED_PACKAGES = list(sys.modules.keys())
ORIGINALLY_LOADED_MODULES = []

def setup_reporter(current_modules=None):
"""
Initialize the reporter
Saves the list of currently loaded modules in a global
variable, so we can ignore the modules that were imported
before this method was called.
"""
if current_modules is None:
current_modules = sys.modules
global ORIGINALLY_LOADED_MODULES
ORIGINALLY_LOADED_MODULES = list(current_modules.keys())

atexit.register(report_popularity)

def get_all_packages():
"""
List all installed packages with their distributions
Returns a dictionary, with the package name as the key
and the list of Distribution objects the package is
provided by.
Warning:
This makes a bunch of filesystem calls so can be expensive if you
have a lot of packages installed on a slow filesystem (like NFS).
"""
packages = {}
for dist in distributions():
for f in dist.files:
if f.name == '__init__.py':
# If an __init__.py file is present, the parent
# directory should be counted as a package
package = str(f.parent).replace('/', '.')
packages.setdefault(package, []).append(dist)
elif f.name == str(f):
# If it is a top level file, it should be
# considered as a package by itself
package = str(f).replace('.py', '')
packages.setdefault(package, []).append(dist)
return packages


def get_used_libraries(current_modules, initial_modules):
"""
Return list of libraries with modules that were imported.
def report_popularity():
Finds the modules present in current_modules but not in
initial_modules, and gets the libraries that provide these
modules.
"""

all_packages = get_all_packages()

libraries = set()

for module_name in current_modules:
if module_name in initial_modules:
# Ignore modules that were already loaded when we were imported
continue

# Only look for packages from distributions explicitly
# installed in the environment. No stdlib, no local installs.
if module_name in all_packages:
for p in all_packages[module_name]:
libraries.add(p.name)

return libraries

def report_popularity(current_modules=None):
"""
Report imported packages to statsd
This runs just before a process exits, so must be very fast.
This runs just before a process exits, so must be as fast as
possible.
"""
if current_modules is None:
current_modules = sys.modules
statsd = StatsClient(
host=os.environ.get('PYTHON_POPCONTEST_STATSD_HOST', 'localhost'),
port=int(os.environ.get('PYTHON_POPCONTEST_STATSD_PORT', 8125)),
prefix=os.environ.get('PYTHON_POPCONTEST_STATSD_PREFIX', 'python_popcon.imported_package')
prefix=os.environ.get('PYTHON_POPCONTEST_STATSD_PREFIX', 'python_popcon')
)

packages = set()
for name in sys.modules:
if name in ORIGINALLY_LOADED_PACKAGES:
# Ignore packages that were already loaded when we were imported
continue
if name in stdlib_list():
# Ignore packages in stdlib
continue
if name[0] == '_':
# Ignore packages starting with `_`
continue
packages.add(name.split('.')[0])
libraries = get_used_libraries(current_modules, ORIGINALLY_LOADED_MODULES)

# Use a statsd pipeline to reduce total network usage
with statsd.pipeline() as stats_pipe:
for p in packages:
stats_pipe.incr(p, 1)
for p in libraries:
stats_pipe.incr(f'library_used.{p}', 1)
stats_pipe.send()


atexit.register(report_popularity)

statsd.incr('reports', 1)
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
author="Yuvi Panda",
packages=setuptools.find_packages(),
install_requires=[
'stdlib-list',
'statsd'
'statsd',
'importlib_metadata'
]
)
Loading

0 comments on commit f3676d4

Please sign in to comment.