Instrument libraries (distributions) rather than packages

To answer the questions we have, we need data on libraries installed in the environment, not packages that are imported. importlib_metadata gives us access to the RECORDS file (https://www.python.org/dev/peps/pep-0376/#record) for every package, and we build a reverse mapping of package name -> distribution once. Distribution (I am calling them libraries) names are then used. Since distributions are 'installed' specifically, they already ignore modules in the standard library and any local user written modules. The dependency on the stdlib_list library can be removed. All metric names have been changed to talk about libraries, not packages. The word 'package' is so overloaded, and nobody knows anything about 'distributions'. https://packaging.python.org/glossary/ is somewhat helpful. I will now try to use just modules (something that can be imported - since our source is sys.modules) and "library" (what is installed with pip or conda - aka a distribution). Despite what python/importlib_metadata#131 says, the package_distributions function in importlib_metadata relies on the undocumented `top_level.txt` file, and does not work with anything not using setuptools. So we go through all the RECORDs ourselves. Added some unit tests, and refactored some functions to make them easier to test. Import-time side effects are definitely harder to test, so I now require an explicit setup function call. This makes testing much easier, and is also more intuitive. Bump version number
yuvipanda · Jul 10, 2021 · f3676d4 · f3676d4
1 parent e53579d
commit f3676d4
Show file tree

Hide file tree

Showing 5 changed files with 228 additions and 64 deletions.
diff --git a/README.md b/README.md
@@ -1,24 +1,24 @@
 # python-popularity-contest
 
 In interactive computing installations, figuring out which python
-modules are in use is extremely helpful in managing environments
+libraries are in use is extremely helpful in managing environments
 for users.
 
-On import, this module will setup an `atexit` hook, which will
-send the list of imported modules to a statsd server for aggregation.
+python-popularity-contest collects pre-aggregated, anonymized data
+on which installed libraries are being actively used by your users.
 
 Named after the [debian popularity contest](https://popcon.debian.org/)
 
 ## What data is collected?
 
 We want to collect just enough data to help with the following tasks:
 
-1. Remove unused packages that have never been imported. These can
+1. Remove unused library that have never been imported. These can
    probably be removed without a lot of breakage for individual
    users
 
-2. Provide aggregate statistics about the 'popularity' of a package
-   to add a data point for understanding how important a particular package is
+2. Provide aggregate statistics about the 'popularity' of a library
+   to add a data point for understanding how important a particular library is
    to a group of users. This can help with funding requests, better
    training recommendations, etc.
 
@@ -27,14 +27,15 @@ source. Only overall global counts are stored, without any individual
 record of each source. This is much better than storing per-user or
 per-process records.
 
-The data we have will be a time series for each package, representing
-the cumulative count of processes where this package was imported. This
-functions as a [prometheus counter](https://prometheus.io/docs/concepts/metric_types/#counter),
-which is how eventually queries are written.
+The data we have will be a time series for each library, representing the
+cumulative count of processes where any module from this library was imported.
+This is designed as a [prometheus
+counter](https://prometheus.io/docs/concepts/metric_types/#counter), which is
+how eventually queries are written.
 
 ## Collection infrastructure
 
-The package emits metrics over the [statsd](https://github.com/statsd/statsd)
+`popularity_contest` emits metrics over the [statsd](https://github.com/statsd/statsd)
 protocol, so you need a statsd server running to collect and aggregate
 this information. Since statsd only stores global aggregate counts, we
 never collect data beyond what we need.
@@ -46,18 +47,18 @@ The recommended collection pipeline is:
 
    A [mapping rule](https://github.com/prometheus/statsd_exporter#glob-matching)
    to convert the statsd metrics into usable prometheus metrics, with
-   helpful labels for package names. Instaed of many metrics named like
-   `python_popcon_imported_package_<package-name>`, we can get a better
-   `python_popcon_imported_package{package="<package-name>}`. A mapping
+   helpful labels for library names. Instaed of many metrics named like
+   `python_popcon_library_used_<library-name>`, we can get a better
+   `python_popcon_library_used{library="<library-name>"}`. A mapping
    rule that works with the default statsd metric name structure would
    look like:
 
    ```yaml
       mappings:
-      - match: "python_popcon.imported_package.*"
-        name: "python_popcon_imported_package"
+      - match: "python_popcon.library_used.*"
+        name: "python_popcon_library_used"
         labels:
-          package: "$1"
+          library: "$1"
    ```
 
    You can add additional labels here if you would like.
@@ -83,10 +84,10 @@ service:
 statsd:
     mappingConfig: |-
         mappings:
-        - match: "python_popcon.imported_package.*"
-        name: "python_popcon_imported_package"
+        - match: "python_popcon.library_used.*"
+        name: "python_popcon_library_used
         labels:
-            package: "$1"
+            library: "$1"
 ```
 
 ## Installing
@@ -104,15 +105,16 @@ It must be installed in the environment we want instrumented.
 
 ### Activation
 
-Once installed, the `popularity_contest.reporter` module needs to
-be imported for reporting to be enabled. You can enable reporting
-for all IPython sessions (and hence Jupyter Notebook sessions)
-with an [IPython startup script](https://switowski.com/blog/ipython-startup-files).
+After installation, the popularity_contest reporter must be explicitly
+set up. You can enable reporting for all IPython sessions (and hence Jupyter
+Notebook sessions) with an [IPython startup
+script](https://switowski.com/blog/ipython-startup-files).
 
 The startup script just needs one line:
 
 ```python
 import popularity_contest.reporter
+popularity_contest.reporter.setup_reporter()
 ```
 
 Since the instrumentation is usually set up by an admin and not
@@ -123,10 +125,10 @@ conda environment installed in `/opt/conda`, you can put the file in
 way, it also gets loaded before any user specific IPython startup
 scripts.
 
-Only packages imported *after* `popularity_contest.reporter`
-was imported will be counted. This reduces noise from baseline
-packages (like `IPython` or `six`) that are used invisibly by
-everyone.
+Only modules imported *after* the reporter is set up with
+`popularity_contest.reporter.setup_reporter()` will be counted.  This reduces
+noise from baseline libraries (like `IPython` or `six`) that are used invisibly
+by everyone.
 
 ### Statsd server connection info
 
@@ -139,9 +141,9 @@ to be set.
    to. With the recommended `prometheus_statsd` setup, this will be
    `9125`.
 3. `PYTHON_POPCONTEST_STATSD_PREFIX` - the prefix each statsd metric
-   will have, defaults to `python_popcon.imported_package`. So
+   will have, defaults to `python_popcon.library_used`. So
    each metric in statsd will be of the form
-   `python_popcon.imported_package.<package-name>`.
+   `python_popcon.library_used.<library-name>`.
 
    You can put additional information in this prefix, and use that
    to extract more labels in prometheus. For example, in a
@@ -155,23 +157,23 @@ to be set.
          import os
          pod_namespace = os.environ['POD_NAMESPACE']
          c.KubeSpawner.environment.update({
-            'PYTHON_POPCONTEST_STATSD_PREFIX': f'python_popcon.namespace.{pod_namespace}.imported_package'
+            'PYTHON_POPCONTEST_STATSD_PREFIX': f'python_popcon.namespace.{pod_namespace}.library_used'
          })
    ```
 
    A mapping rule can be added to `prometheus_statsd` to extract the namespace.
 
    ```yaml
       mappings:
-      - match: "python_popcon.namespace.*.imported_package.*"
-        name: "python_popcon_imported_package"
+      - match: "python_popcon.namespace.*.library_used.*"
+        name: "python_popcon_library_used"
         labels:
           namespace: "$1"
-          package: "$2"
+          library: "$2"
    ```
 
    The prometheus metrics produced out of this will be of the form
-   `python_popcon_imported_package{package="<package-name>", namespace="<namespace>}`
+   `python_popcon_library_used{library="<library-name>", namespace="<namespace>}`
 
 ## Privacy
 
@@ -181,9 +183,9 @@ private information (like usernames tied to activity times, etc).
 
 However, side channel attacks are still possible if the entire
 set of timeseries data is available. Individual users might have specific
-patterns of packages they use, and this might be discernable with enough
-analysis. If some packages are uniquely used only by particular users,
+patterns of modules they use, and this might be discernable with enough
+analysis. If some libraries are uniquely used only by particular users,
 this analysis becomes easier. Further aggregation of the data, redaction
-of information about packages that don't have a lot of use, etc are methods
+of information about modules that don't have a lot of use, etc are methods
 that can be used to further anonymize this dataset, based on your needs.
 
diff --git a/dev-requirements.txt b/dev-requirements.txt
@@ -0,0 +1,3 @@
+pytest
+pytest-cov
+pytest-mock
diff --git a/popularity_contest/reporter.py b/popularity_contest/reporter.py
@@ -6,53 +6,112 @@
 for users.
 
 On import, this module will setup an `atexit` hook, which will
-send the list of imported modules to a statsd server for aggregation.
-
+send the list of distributions (libraries) from which packages
+have been imported. stdlib and local modules are ignord.
 """
 import sys
 import atexit
 import os
-from stdlib_list import stdlib_list
 from statsd import StatsClient
+from importlib_metadata import distributions
 
 # Make a copy of packages that have already been loaded
 # until this point. These will not be reported to statsd,
 # since these are 'infrastructure' packages that are needed
 # by everyone, regardless of the specifics of the code being
 # written.
-ORIGINALLY_LOADED_PACKAGES = list(sys.modules.keys())
+ORIGINALLY_LOADED_MODULES = []
+
+def setup_reporter(current_modules=None):
+    """
+    Initialize the reporter
+
+    Saves the list of currently loaded modules in a global
+    variable, so we can ignore the modules that were imported
+    before this method was called.
+    """
+    if current_modules is None:
+        current_modules = sys.modules
+    global ORIGINALLY_LOADED_MODULES
+    ORIGINALLY_LOADED_MODULES = list(current_modules.keys())
+
+    atexit.register(report_popularity)
+
+def get_all_packages():
+    """
+    List all installed packages with their distributions
+
+    Returns a dictionary, with the package name as the key
+    and the list of Distribution objects the package is
+    provided by.
+
+    Warning:
+        This makes a bunch of filesystem calls so can be expensive if you
+        have a lot of packages installed on a slow filesystem (like NFS).
+    """
+    packages = {}
+    for dist in distributions():
+        for f in dist.files:
+            if f.name == '__init__.py':
+                # If an __init__.py file is present, the parent
+                # directory should be counted as a package
+                package = str(f.parent).replace('/',  '.')
+                packages.setdefault(package, []).append(dist)
+            elif f.name == str(f):
+                # If it is a top level file, it should be
+                # considered as a package by itself
+                package = str(f).replace('.py', '')
+                packages.setdefault(package, []).append(dist)
+    return packages
+
+
+def get_used_libraries(current_modules, initial_modules):
+    """
+    Return list of libraries with modules that were imported.
 
-def report_popularity():
+    Finds the modules present in current_modules but not in
+    initial_modules, and gets the libraries that provide these
+    modules.
+    """
+
+    all_packages = get_all_packages()
+
+    libraries = set()
+
+    for module_name in current_modules:
+        if module_name in initial_modules:
+            # Ignore modules that were already loaded when we were imported
+            continue
+
+        # Only look for packages from distributions explicitly
+        # installed in the environment. No stdlib, no local installs.
+        if module_name in all_packages:
+            for p in all_packages[module_name]:
+                libraries.add(p.name)
+
+    return libraries
+
+def report_popularity(current_modules=None):
     """
     Report imported packages to statsd
 
-    This runs just before a process exits, so must be very fast.
+    This runs just before a process exits, so must be as fast as
+    possible.
     """
+    if current_modules is None:
+        current_modules = sys.modules
     statsd = StatsClient(
         host=os.environ.get('PYTHON_POPCONTEST_STATSD_HOST', 'localhost'),
         port=int(os.environ.get('PYTHON_POPCONTEST_STATSD_PORT', 8125)),
-        prefix=os.environ.get('PYTHON_POPCONTEST_STATSD_PREFIX', 'python_popcon.imported_package')
+        prefix=os.environ.get('PYTHON_POPCONTEST_STATSD_PREFIX', 'python_popcon')
     )
 
-    packages = set()
-    for name in sys.modules:
-        if name in ORIGINALLY_LOADED_PACKAGES:
-            # Ignore packages that were already loaded when we were imported
-            continue
-        if name in stdlib_list():
-            # Ignore packages in stdlib
-            continue
-        if name[0] == '_':
-            # Ignore packages starting with `_`
-            continue
-        packages.add(name.split('.')[0])
+    libraries = get_used_libraries(current_modules, ORIGINALLY_LOADED_MODULES)
 
     # Use a statsd pipeline to reduce total network usage
     with statsd.pipeline() as stats_pipe:
-        for p in packages:
-            stats_pipe.incr(p, 1)
+        for p in libraries:
+            stats_pipe.incr(f'library_used.{p}', 1)
         stats_pipe.send()
 
-
-atexit.register(report_popularity)
-
+    statsd.incr('reports', 1)
diff --git a/setup.py b/setup.py
@@ -7,7 +7,7 @@
     author="Yuvi Panda",
     packages=setuptools.find_packages(),
     install_requires=[
-        'stdlib-list',
-        'statsd'
+        'statsd',
+        'importlib_metadata'
     ]
 )