# Initial recommendation engine design (PoC)

In [1]:
import typing
import logging
from enum import Enum
from enum import auto

_LOGGER = logging.getLogger(__name__)
logging.basicConfig()
_LOGGER.setLevel(logging.DEBUG)

In [2]:
from thoth.lab import packages_info
# Show working environment to have this reproducible.
packages_info()

Unnamed: 0,importable,package,version
0,True,thoth.adviser,0.0.2
1,True,thoth.analyzer,0.0.5
2,True,thoth.common,0.0.3
3,True,thoth.lab,0.0.3
4,True,thoth.package_extract,1.0.0
5,True,thoth.solver,1.0.2
6,True,thoth.storages,0.0.29


We will use utilities already present in the Thoth's code base. We will also use internal API of `pip`. Note that you need to have installed `pip<10` as API changed recently with release 10.

In [3]:
from thoth.solver import pip_compile
from pip._vendor.packaging.requirements import Requirement
from pip._vendor.packaging.specifiers import SpecifierSet

We will provide three basic recommendations:

  1. **STABLE** - based on knowledge we have, we know that the given software stack will work in the given environmnet
  2. **TESTING** - exclude packages that can cause errors, leave packages for which we don't have information about (testing their behavior in a software stack)
  3. **LATEST** - always a bleeding edge software stack

In [4]:
class RecommendationType(Enum):
    STABLE = auto()
    TESTING = auto()
    LATEST = auto()

In examples in this notebook we will be assuming we request `flask` and `tensorflow` in our software stack. This input can directly come from `requirements.txt` so it is ok to put even version specifiers.

In [5]:
raw_requirements = """
tensorflow
flask>=1.0
"""

We leave the resolution logic on `pip-compile` that resolves the given software stack and provides full pinned-down software stack specification for Python packages that are direct or transitive dependencies of our requirements.

In [6]:
print(pip_compile(*raw_requirements.splitlines()))

#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --output-file requirements.txt requirements.in
#
absl-py==0.2.2            # via tensorflow
astor==0.6.2              # via tensorflow
bleach==1.5.0             # via tensorboard
click==6.7                # via flask
flask==1.0.2
gast==0.2.0               # via tensorflow
grpcio==1.12.0            # via tensorflow
html5lib==0.9999999       # via bleach, tensorboard
itsdangerous==0.24        # via flask
jinja2==2.10              # via flask
markdown==2.6.11          # via tensorboard
markupsafe==1.0           # via jinja2
numpy==1.14.3             # via tensorboard, tensorflow
protobuf==3.5.2.post1     # via tensorboard, tensorflow
six==1.11.0               # via absl-py, bleach, grpcio, html5lib, protobuf, tensorboard, tensorflow
tensorboard==1.8.0        # via tensorflow
tensorflow==1.8.0
termcolor==1.1.0          # via tensorflow
werkzeug==0.14.1          # via flask, tensorboard
wheel==0.31.1         

We will directly reuse logic offered by pip to correctly parse packages from their textual representation considering version ranges.

In [7]:
def parse_requirements(requiremnets: str) -> typing.List[Requirement]:
    return [Requirement(req) for req in raw_requirements.splitlines() if req and not req.strip().startswith('#')]

For demonstration purposes, let's parse our initial software stack requirements:

In [8]:
parsed_requirements = parse_requirements(raw_requirements)
parsed_requirements

[<Requirement('tensorflow')>, <Requirement('flask>=1.0')>]

Let's assume we have a knowledge base that stores information about a package in its version (package-version level information). In this example our knowledge base states if the given package is good (`True`) or bad (`False` - meaning errors such as installation error into the requested environment). If there is no package record it means we don't have any observations for the given package that could be used for recommendations. This is especially usefull for the `TESTING` recommendation type in which we add such packages to our software stack (e.g. testing purposes, no stable version for the given software stack).

In [9]:
KNOWLEDGE_BASE = {
    'absl-py==0.2.2': True,
    'astor==0.6.2': True,
    'bleach==1.5.0': True,
    'click==6.7': True,
    'flask==1.0.2': True,
    'gast==0.2.0': True,
    'grpcio==1.12.0': True,
    'html5lib==0.9999999': True,
    'itsdangerous==0.24': True,
    'jinja2==2.10': True,
    'markdown==2.6.11': True,
    'markupsafe==1.0': True,
    'numpy==1.14.3': True,
    'protobuf==3.5.2.post1': True,
    'six==1.11.0': True,
    'tensorboard==1.8.0': False,
    'tensorboard==1.7.0': True,
    'tensorflow==1.7.0': True,
    'tensorflow==1.7.1': None,
    'termcolor==1.1.0': True,
    'werkzeug==0.14.1': True,
    'wheel==0.31.1': True,
}

As we use pip's internal requirement and specification abstractions, let's create a wrapper around `pip-compile` that will prepare input for pip-compile and parse its output so we keep parsed requirements as Python objects.

In [10]:
from itertools import chain


def _get_from_dependencies(comment: str):
    result = []
    
    comment = comment[len('via '):]
    for dep in comment.split(','):
        result.append(dep.strip())

    return result

def execute_pip_compile(*requirements: Requirement) -> typing.List[Requirement]:
    result = []
    graph = {}
    
    output = pip_compile(*[str(req) for req in requirements])
    for line in output.splitlines():
        if line.startswith('#'):
            # Skip leading pip-compile comments.
            continue
        line = line.split('#', maxsplit=1)
        
        if len(line) == 2:
            requirement, comment = line
            requirement = Requirement(requirement)
            result.append(requirement)
            
            from_dependencies = _get_from_dependencies(comment)
            graph[requirement.name] = from_dependencies
        else:
            requirement = Requirement(line[0])
            result.append(requirement)
            # This is root node.
            graph[requirement.name] = []
        
    return result, graph

Let's perform `pip-compile` on our initial software stack requirements that are already parsed into Python objects:

In [11]:
requirements, graph = execute_pip_compile(*parsed_requirements)

requirements, graph

([<Requirement('absl-py==0.2.2')>,
  <Requirement('astor==0.6.2')>,
  <Requirement('bleach==1.5.0')>,
  <Requirement('click==6.7')>,
  <Requirement('flask==1.0.2')>,
  <Requirement('gast==0.2.0')>,
  <Requirement('grpcio==1.12.0')>,
  <Requirement('html5lib==0.9999999')>,
  <Requirement('itsdangerous==0.24')>,
  <Requirement('jinja2==2.10')>,
  <Requirement('markdown==2.6.11')>,
  <Requirement('markupsafe==1.0')>,
  <Requirement('numpy==1.14.3')>,
  <Requirement('protobuf==3.5.2.post1')>,
  <Requirement('six==1.11.0')>,
  <Requirement('tensorboard==1.8.0')>,
  <Requirement('tensorflow==1.8.0')>,
  <Requirement('termcolor==1.1.0')>,
  <Requirement('werkzeug==0.14.1')>,
  <Requirement('wheel==0.31.1')>],
 {'absl-py': ['tensorflow'],
  'astor': ['tensorflow'],
  'bleach': ['tensorboard'],
  'click': ['flask'],
  'flask': [],
  'gast': ['tensorflow'],
  'grpcio': ['tensorflow'],
  'html5lib': ['bleach', 'tensorboard'],
  'itsdangerous': ['flask'],
  'jinja2': ['flask'],
  'markdown': ['ten

Now as we have a graph of dependencies that is serialized into a dictionary, we can ask which package introduced which package as a dependency:

In [12]:
from collections import deque


def find_roots(graph, package_name):
    assert package_name in graph, f"The requested package {package_name} does not occur in the dependency graph {graph}"
    
    result = deque()
    queue = deque([(package_name, [])])
    while queue:
        package_name, traversed = queue.pop()
        ancestors = graph.get(package_name)
        
        if not ancestors:
            if traversed:
                result.append(traversed)
            continue
        
        for ancestor in ancestors:
            queue.append((ancestor, traversed + [ancestor]))

    return list(result)

In [13]:
# Werkzeug is a dependency of flask directly and tensorflow via tensorboard

find_roots(graph, 'six')

[['tensorflow'],
 ['tensorboard', 'tensorflow'],
 ['protobuf', 'tensorflow'],
 ['protobuf', 'tensorboard', 'tensorflow'],
 ['html5lib', 'tensorboard', 'tensorflow'],
 ['html5lib', 'bleach', 'tensorboard', 'tensorflow'],
 ['grpcio', 'tensorflow'],
 ['bleach', 'tensorboard', 'tensorflow'],
 ['absl-py', 'tensorflow']]

In [14]:
from itertools import chain



In [15]:
parsed_requirements

[<Requirement('tensorflow')>, <Requirement('flask>=1.0')>]

In the initial recommendation function we check packages against our knowledge base and based on recommendation type, we eigher allow resolved package to be present in the final software stack or simply exclude it from the final application software stack:

In [16]:
from collections import deque
from copy import copy
from thoth.solver.exceptions import ThothPipCompileError


def _get_version(package_name, pinned_requirements):
    for requirement in pinned_requirements:
        if requirement.name == package_name:
            return str(requirement).split('==', maxsplit=1)[1]

    raise ValueError

def exclude_requirement(requirement: Requirement,
                        requirements: typing.List[Requirement],
                        pinned_requirements: typing.List[Requirement],
                        dependency_graph: dict) -> typing.List[Requirement]:
    candidates = []

    requirement_version = str(requirement).split('==', maxsplit=1)[1]
    new_requirements = list(requirements)
    new_requirements.append(Requirement(f"{requirement.name}!={requirement_version}"))
    candidates.append(new_requirements)
    
    # Also all transitive requirements.
    packages = find_roots(dependency_graph, requirement.name)
    for package in set(chain(*packages)):
        package_version = _get_version(package, pinned_requirements)
        new_requirements = list(requirements)
        new_requirements.append(Requirement(f"{package}!={package_version}"))
        candidates.append(new_requirements)

    return candidates
    
    

def recommend(requirements: typing.List[str], recommendation_type: RecommendationType=RecommendationType.TESTING) -> typing.List[str]:
    info = {}
    requirements = parse_requirements(requirements)

    if recommendation_type == RecommendationType.LATEST:
        # Early stop for LATEST
        return {'stacks': [requirements], 'info': "Warning: observations were not considered when LATEST is used"}

    stacks = []
    queue = deque([requirements])
    while queue:
        requirements = queue.pop()

        _LOGGER.info(f"New resolution run for requirements: {[str(req) for req in requirements]}")

        try:
            pinned_requirements, dependency_graph = execute_pip_compile(*requirements)
        except ThothPipCompileError as exc:
            _LOGGER.warning(f"Requirement specification was invalid: {[str(req) for req in requirements]}: {str(exc)}")
            continue

        for requirement in pinned_requirements:
            is_ok = KNOWLEDGE_BASE.get(str(requirement))

            if is_ok is None and recommendation_type == RecommendationType.TESTING:
                info[str(requirement)] = "Warning: No observations found"
            elif (is_ok is None and recommendation_type == RecommendationType.STABLE) or is_ok is False:
                justification = "Package excluded - negative observations found in the knowledge database" if is_ok is False else "Package excluded - no observations found in the knowledge database"
                info[str(requirement)] = justification
                candidates = exclude_requirement(
                    requirement,
                    requirements,
                    pinned_requirements,
                    dependency_graph,
                )
                queue.extend(candidates)
                break
        else:
            stacks.append(pinned_requirements)

    return {'stacks': list(map(lambda s: [str(req) for req in s], stacks)), 'info': info}

Let's resolve a software stack for our requirements. In this case we allow potentially unstable environment - recommendation type is `TESTING`. We produce a warning as we do not have any information about `tensorflow` in version `1.8.0`.

In [17]:
%%time

recommend(['flask>=1.0', 'tensorflow'], RecommendationType.TESTING)

INFO:__main__:New resolution run for requirements: ['tensorflow', 'flask>=1.0']
INFO:__main__:New resolution run for requirements: ['tensorflow', 'flask>=1.0', 'tensorflow!=1.8.0']
INFO:__main__:New resolution run for requirements: ['tensorflow', 'flask>=1.0', 'tensorboard!=1.8.0']
Tried: 1.0.0a3, 1.0.0a4, 1.0.0a5, 1.0.0a6, 1.6.0rc0, 1.6.0, 1.7.0, 1.8.0
There are incompatible versions in the resolved dependencies.



CPU times: user 3.67 s, sys: 42.4 ms, total: 3.72 s
Wall time: 3.93 s


{'info': {'tensorboard==1.8.0': 'Package excluded - negative observations found in the knowledge database',
 'stacks': [['absl-py==0.2.2',
   'astor==0.6.2',
   'bleach==1.5.0',
   'click==6.7',
   'flask==1.0.2',
   'gast==0.2.0',
   'grpcio==1.12.0',
   'html5lib==0.9999999',
   'itsdangerous==0.24',
   'jinja2==2.10',
   'markdown==2.6.11',
   'markupsafe==1.0',
   'numpy==1.14.3',
   'protobuf==3.5.2.post1',
   'six==1.11.0',
   'tensorboard==1.7.0',
   'tensorflow==1.7.1',
   'termcolor==1.1.0',
   'werkzeug==0.14.1',
   'wheel==0.31.1']]}

Now let's assume we would like to have a stable environment - recommendation type is `STABLE`. In this case package `tensorflow` in version `1.8.0` gets explicitly excluded due to unavailability of observations causing new resolution rounds. The next resolution suggests to use `tensorflow` in version `1.7.1` for which we have negative observations (probably there were spotted issues in the given environment). The next resolution round thus fallbacks to use `tensorflow` in version `1.7.0` that is the stable software stack based on our knowladge base:

In [18]:
%%time

recommend(['flask>=1.0', 'tensorflow'], RecommendationType.STABLE)

INFO:__main__:New resolution run for requirements: ['tensorflow', 'flask>=1.0']
INFO:__main__:New resolution run for requirements: ['tensorflow', 'flask>=1.0', 'tensorflow!=1.8.0']
INFO:__main__:New resolution run for requirements: ['tensorflow', 'flask>=1.0', 'tensorflow!=1.8.0', 'tensorflow!=1.7.1']
INFO:__main__:New resolution run for requirements: ['tensorflow', 'flask>=1.0', 'tensorboard!=1.8.0']
Tried: 1.0.0a3, 1.0.0a4, 1.0.0a5, 1.0.0a6, 1.6.0rc0, 1.6.0, 1.7.0, 1.8.0
There are incompatible versions in the resolved dependencies.



CPU times: user 5.19 s, sys: 26.1 ms, total: 5.22 s
Wall time: 5.26 s


{'info': {'tensorboard==1.8.0': 'Package excluded - negative observations found in the knowledge database',
  'tensorflow==1.7.1': 'Package excluded - no observations found in the knowledge database'},
 'stacks': [['absl-py==0.2.2',
   'astor==0.6.2',
   'bleach==1.5.0',
   'click==6.7',
   'flask==1.0.2',
   'gast==0.2.0',
   'grpcio==1.12.0',
   'html5lib==0.9999999',
   'itsdangerous==0.24',
   'jinja2==2.10',
   'markdown==2.6.11',
   'markupsafe==1.0',
   'numpy==1.14.3',
   'protobuf==3.5.2.post1',
   'six==1.11.0',
   'tensorboard==1.7.0',
   'tensorflow==1.7.0',
   'termcolor==1.1.0',
   'werkzeug==0.14.1',
   'wheel==0.31.1']]}

**TODO:**
 * mention that observations are per environment - might differ on Fedora26, Fedora27, ...
 * mention that we need to inspect software stack in overall - how the given software stack works as a unit
   * as there is large amount of possible software stacks (given the package versions and packages themselves), we could train a model for this and perform predictions
 * we will need to restrict versions on direct dependencies - we need to know which direct dependency introduced the given dependency