# Pipeline units in a software stack resolution process

| Info | Data |
| ------:| -----------:|
| **Author** | Fridolin Pokorny <fridolin@redhat.com> |
| **Date** | 27th Oct 2020 |
| **Last change** | 29th Oct 2020 |

![Resolution pipeline](https://github.com/thoth-station/adviser/raw/master/docs/source/_static/pipeline.gif?raw=true)

This Jupyter Notebook demonstrates pipeline units and pipeline configuration in [Thoth's adviser](https://github.com/thoth-station/adviserhttps://github.com/thoth-station/adviser). The scenario shown resolves ``intel-tensorflow==2.0.1`` instead of ``tensorflow==2.1.0`` based on pipeline configuration supplied to the resolution process. Follow [online documentation of project Thoth for more info](https://thoth-station.ninja/docs/developers/adviser/https://thoth-station.ninja/docs/developers/adviser/).

The notebook is intended for developers and for those who are interested in adviser internals. The steps shown are automatically done on the backend side and are not required by the actual Thoth user.

See [published TowardsDataScience article](https://towardsdatascience.com/how-to-beat-pythons-pip-software-stack-resolution-pipelines-21bc37f01a93) and a [YouTube video demonstrating this notebook](https://www.youtube.com/watch?v=OCX8JQDXP9s).

## Importing required bits and library versions

In [3]:
import yaml
import random
import sys
from pprint import pprint

from thoth.adviser import Resolver
from thoth.adviser import PipelineBuilder
from thoth.adviser import PipelineConfig
from thoth.adviser import RecommendationType
from thoth.adviser import __version__
from thoth.python import Project
from thoth.common import RuntimeEnvironment
from thoth.common import init_logging
from thoth.storages import GraphDatabase
import thoth.adviser.predictors as predictors

init_logging()
print("Adviser version: ", __version__)

2020-10-29 08:43:40,824 1403397 INFO     thoth.common:366: Logging to rsyslog endpoint is turned off


Adviser version:  0.19.0


## Project instantiation

We declare a dependency ``tensorflow==2.1.0`` which runs on Red Hat Enterprise Linux 8 (linux, x86_64). The notebook will use pre-aggregated knowledge stored and exposed locally. See [thoth-station/storages](https://github.com/thoth-station/storages) for more info on how to setup a local database instance.

In [4]:
PIPFILE = """
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]

[packages]
tensorflow = "==2.1.0"

[requires]
python_version = "3.6"
"""

# The runtime environment configuration can capture various parameters. We use just OS information, Python version and platform info. Information about hardware is unused.
runtime_environment = RuntimeEnvironment.from_dict({
    "hardware": {
        "cpu_family": None,
        "cpu_model": None
    },
    "operating_system": {
        "name": "rhel",
        "version": "8"
    },
    "python_version": "3.6",
    "cuda_version": None,
    "platform": "linux-x86_64"
})
project = Project.from_strings(PIPFILE, runtime_environment=runtime_environment)

## Pipeline configuration

Next, we will create a pipeline configuration we want to use during the software stack resolution process. We do this manually here to demonstrate how the pipeline works. The pipeline is constructed autonomously on the backend side and does not require any additional user interaction when running the recommendation engine.

In [5]:
_PIPELINE_CONF = """
boots:
- configuration:
    package_name: null
  name: PythonVersionBoot
- configuration:
    package_name: null
  name: RHELVersionBoot
- configuration:
    default_platform: linux-x86_64
  name: PlatformBoot
- configuration:
    package_name: null
  name: FullySpecifiedEnvironment
pseudonyms:
- configuration:
    package_name: tensorflow
    package_version: "2.1.0"
    index_url: "https://pypi.org/simple"
    aliases:
      - package_name: intel-tensorflow
        package_version: "2.1.0"
        index_url: "https://pypi.org/simple"
      - package_name: intel-tensorflow
        package_version: "2.0.1"
        index_url: "https://pypi.org/simple"
  name: AliasPseudonym
sieves:
- configuration:
    package_name: null
    without_error: true
  name: SolvedSieve
steps:
- configuration:
    package_name: "intel-tensorflow"
    package_version: "2.1.0"
    index_url: "https://pypi.org/simple"
    score: -0.2
  name: SetScoreStep
- configuration:
    package_name: "intel-tensorflow"
    package_version: "2.0.1"
    index_url: "https://pypi.org/simple"
    score: 1.0
  name: SetScoreStep
- configuration:
    package_name: "protobuf"
    package_version: "3.11.1"
    index_url: "https://pypi.org/simple"
    score: -0.5
  name: SetScoreStep
strides: []
wraps:
- configuration: {}
  name: MKLThreadsWrap
"""
                           

def get_pipeline_config() -> PipelineConfig:
    """Get pipeline configuration."""
    conf = yaml.safe_load(_PIPELINE_CONF)
    return PipelineBuilder.from_dict(conf)

pipeline_config = get_pipeline_config()

The pipeline configuration consists of [pipeline units that are of different types](https://thoth-station.ninja/docs/developers/adviser/unit.html) serving different purposes. Names of pipeline types respect their relative call order in the resolution process when sorted alphabetically: [Boot](https://thoth-station.ninja/docs/developers/adviser/boots.html), [Pseydonym](https://thoth-station.ninja/docs/developers/adviser/pseudonyms.html), [Sieve](https://thoth-station.ninja/docs/developers/adviser/sieves.html), [Step](https://thoth-station.ninja/docs/developers/adviser/steps.html), [Stride](https://thoth-station.ninja/docs/developers/adviser/strides.html) and [Wrap](https://thoth-station.ninja/docs/developers/adviser/wraps.html). Follow the linked documentation if you are interested in the technical details. This notebook will go through the pipeline units present and will show their usage in the software stack resolution process.

The first pipeline units, [boots](https://thoth-station.ninja/docs/developers/adviser/boots.html), are called prior the actual resolution process to boot up the resolution pipeline. An example can be `FullySpecifiedEnvironment` boot that checks if the supplied configuration states all the required properties for the runtime environment - it checks whether operating system, its version and Python interpreter version are supplied:

In [6]:
yaml.safe_dump(pipeline_config.to_dict()["boots"][-1], sys.stdout)

configuration:
  package_name: null
name: FullySpecifiedEnvironment
unit_run: false


One of the next pipelines registered is a pipeline unit called ``AliasPseudonym``. As the name suggests, it's a [pipeline unit of type pseudonym](https://thoth-station.ninja/docs/developers/adviser/pseudonyms.html) which will consider packages as alternatives (pseudonyms). More specifically, it will consider the following two packages:

* intel-tensorflow in version 2.1.0 from PyPI
* intel-tensorflow in version 2.0.1 from PyPI

as pseudonyms to tensorflow 2.1.0 comming from PyPI. One can see this operation as replacing nodes in the dependency graph to generate alternatives - besides tensorflow==2.1.0 from PyPI, the dependency graph will provide also the two alternatives stated and respect their dependncies (so the replacement is a valid operation considering Python packaging, not just a blind node replacement in the dependency graph). This operation can be done on transitive dependencies as well as on the direct ones.

**Note** Mind the minor and the patch version in ``intel-tensorflow`` packages.

In [4]:
yaml.safe_dump(pipeline_config.to_dict()["pseudonyms"], sys.stdout)

- configuration:
    aliases:
    - index_url: https://pypi.org/simple
      package_name: intel-tensorflow
      package_version: 2.1.0
    - index_url: https://pypi.org/simple
      package_name: intel-tensorflow
      package_version: 2.0.1
    index_url: https://pypi.org/simple
    package_name: tensorflow
    package_version: 2.1.0
  name: AliasPseudonym
  unit_run: false


Let's move on to the next [pipeline unit which is of type sieve](https://thoth-station.ninja/docs/developers/adviser/sieves.html). The main aim of this pipeline unit is to keep dependencies that are solved using [Thoth's solver](https://github.com/thoth-station/solver), meaning the dependency graph can be fully constructed and the resolution can lead to a valid software stack considering Python packaging rules (version range specifications). Moreover, this pipeline unit will filter out all the packages that have installation errors in the target runtime environment. By doing so, we are sure the resolution pipeline produces software stacks that do not fail during application assembling in the target environment.

In [5]:
yaml.safe_dump(pipeline_config.to_dict()["sieves"], sys.stdout)

- configuration:
    package_name: null
    without_error: true
  name: SolvedSieve
  unit_run: false


Now, let's move on to [pipeline units of type step](https://thoth-station.ninja/docs/developers/adviser/steps.html). These pipeline units were primarly designed to score software packages and thus tell the resolution process how good a resolved software stack is. The scoring can consider various aspects of the software stack. An example can be known vulnerabilities of packages or performance aspects of the resolved stack.

For simplicity, we assign scores to the packages explicitly without any semantics. The three pipeline units registered will make sure:

* intel-tensorflow in version 2.1.0 from PyPI will be scored -0.2 (negative score)
* intel-tensorflow in version 2.0.1 from PyPI will be scored 1.0 (high positive score)
* protobuf in version 3.11.3 from PyPI will be scored -0.5 (negative score)

The resolver will use these "observations" to come up with the best possible software stack respecting the score assigned.

In [6]:
yaml.safe_dump(pipeline_config.to_dict()["steps"], sys.stdout)

- configuration:
    index_url: https://pypi.org/simple
    multi_package_resolution: false
    package_name: intel-tensorflow
    package_version: 2.1.0
    score: -0.2
  name: SetScoreStep
  unit_run: false
- configuration:
    index_url: https://pypi.org/simple
    multi_package_resolution: false
    package_name: intel-tensorflow
    package_version: 2.0.1
    score: 1.0
  name: SetScoreStep
  unit_run: false
- configuration:
    index_url: https://pypi.org/simple
    multi_package_resolution: false
    package_name: protobuf
    package_version: 3.11.1
    score: -0.5
  name: SetScoreStep
  unit_run: false


One of the last pipeline units stated is a [pipeline unit of type wrap](https://thoth-station.ninja/docs/developers/adviser/wraps.html). The wrap registered, called `MKLThreadsWrap` will show a message to the user about MKL configuration and will suggest manifest changes for manifests used to deploy the application (see below):

In [7]:
yaml.safe_dump(pipeline_config.to_dict()["wraps"], sys.stdout)

- configuration:
    package_name: null
  name: MKLThreadsWrap
  unit_run: false


![State space](https://thoth-station.ninja/docs/developers/adviser/images/state_space_interpolated.png)

## Resolution process

Let's proceed to the resolution process. We will use "[Approximating latest](https://thoth-station.ninja/docs/developers/adviser/predictors/latest.htmlhttps://thoth-station.ninja/docs/developers/adviser/predictors/latest.html)" predictor which will try to come up with the most recent packages in the stack, considering their versioning.

In [8]:
%%time

predictor = predictors.ApproximatingLatest(keep_history=False)
resolver = Resolver.get_adviser_instance(
    predictor=predictor,
    project=project,
    recommendation_type=RecommendationType.LATEST,  # Use "latest" recommendation type, has no effect in pipeline units used.
    limit=10000,  # Limit number of software stacks scored.
    count=1,  # We want just one software stack to be shown in the final report.
    beam_width=None,  # No limitation in memory consumption for internal resolver states.
    pipeline_config=pipeline_config,
)

2020-10-27 23:36:22,885 1354697 INFO     alembic.runtime.migration:155: Context impl PostgresqlImpl.
2020-10-27 23:36:22,886 1354697 INFO     alembic.runtime.migration:162: Will assume transactional DDL.


CPU times: user 199 ms, sys: 11.8 ms, total: 211 ms
Wall time: 220 ms


In [9]:
%%time

random.seed(30)  # Set seed to have reproducible results across runs.
resolver.graph.cache_clear()  # Clear the cache so it does not affect speed in multiple invocations.
report = resolver.resolve(with_devel=False, user_stack_scoring=False)

2020-10-27 23:36:22,945 1354697 INFO     thoth.adviser.resolver:1083: No scoring done on user's stack - see https://thoth-station.ninja/j/user_stack
2020-10-27 23:36:22,946 1354697 INFO     thoth.adviser.resolver:1085: Preparing initial states for the resolution pipeline
2020-10-27 23:36:22,948 1354697 INFO     thoth.adviser.resolver:618: Resolving direct dependencies
2020-10-27 23:36:23,131 1354697 INFO     thoth.adviser.resolver:653: Found direct dependency 'tensorflow' with version specification '==2.1.0'
2020-10-27 23:36:23,135 1354697 INFO     thoth.adviser.resolver:1089: Hold tight, Thoth is computing recommendations for your application...
2020-10-27 23:36:28,820 1354697 INFO     thoth.adviser.resolver:1196: Pipeline reached 1 final states out of 10000 requested in iteration 342 (pipeline pace 0.17 stacks/second); top rated software stack in beam has a score of 1.00; top rated software stack found so far has a score of -0.20
2020-10-27 23:36:37,283 1354697 INFO     thoth.adviser

CPU times: user 14 s, sys: 335 ms, total: 14.3 s
Wall time: 19.7 s


Results shown below demonstrate that the resolution process found ``intel-tensorflow==2.0.1`` as an alternative to ``tensorflow==2.1.0`` which was originally stated in the requirements file (Pipfile). Moreover, the resolved software stack does not provide specific version of ``protobuf`` which would affect the application stack negatively. All these statements support the pipeline configuration we provided.

The `stack_info` part of the report shows which packages were not considered during the resolution process as they would produce application assembling issues (they cannot be installed into the given runtime environment).

In [10]:
yaml.safe_dump(report.to_dict(), sys.stdout, sort_keys=True, indent=2)

accepted_final_states_count: 10000
discarded_final_states_count: 0
pipeline:
  boots:
  - configuration:
      package_name: null
    name: PythonVersionBoot
    unit_run: true
  - configuration:
      package_name: null
    name: RHELVersionBoot
    unit_run: true
  - configuration:
      default_platform: linux-x86_64
    name: PlatformBoot
    unit_run: true
  - configuration:
      package_name: null
    name: FullySpecifiedEnvironment
    unit_run: true
  pseudonyms:
  - configuration:
      aliases:
      - index_url: https://pypi.org/simple
        package_name: intel-tensorflow
        package_version: 2.1.0
      - index_url: https://pypi.org/simple
        package_name: intel-tensorflow
        package_version: 2.0.1
      index_url: https://pypi.org/simple
      package_name: tensorflow
      package_version: 2.1.0
    name: AliasPseudonym
    unit_run: true
  sieves:
  - configuration:
      package_name: null
      without_error: true
    name: SolvedSieve
    unit_run: tr

Note also [manifest changes suggested](https://thoth-station.ninja/docs/developers/adviser/manifest_changes.html) by the recommendation engine (by the `MKLThreadsWrap`):

In [11]:
yaml.safe_dump(report.to_dict()["products"][0]["advised_manifest_changes"], sys.stdout, sort_keys=True, indent=2)

- - 'apiVersion:': apps.openshift.io/v1
    kind: DeploymentConfig
    patch:
      op: add
      path: /spec/template/spec/containers/0/env/0
      value:
        name: OMP_NUM_THREADS
        value: '1'


... and additional information to the user:

In [12]:
yaml.safe_dump(report.to_dict()["products"][0]["justification"], sys.stdout, sort_keys=True, indent=2)

- link: https://thoth-station.ninja/j/mkl_threads
  message: Consider adjusting OMP_NUM_THREADS environment variable for containerized
    deployments, one or more libraries use Intel's MKL that does not detect correctly
    resource allocation in the cluster
- link: https://thoth-station.ninja/j/mkl_libs
  message: Make sure your environment has proper Intel Performance Libraries when
    using Intel TensorFlow builds


Just to compare results obtained above, let's trigger another resolution process, but now we will not provide any ``intel-tensorflow`` packages as pseudonyms and we will not perform any package scoring. The resolved software stack will hold ``tensorflow==2.1.0`` as required by the application (respecting the Pipfile file) and more recent ``protobuf`` that is not penalized.

In [13]:
_PIPELINE_CONF = """
boots:
- configuration:
    package_name: null
  name: PythonVersionBoot
- configuration:
    package_name: null
  name: RHELVersionBoot
- configuration:
    default_platform: linux-x86_64
  name: PlatformBoot
- configuration:
    package_name: null
  name: FullySpecifiedEnvironment
pseudonyms: []
sieves:
- configuration:
    package_name: null
    without_error: true
  name: SolvedSieve
steps: []
strides: []
wraps: []
"""
                           

def get_pipeline_config() -> PipelineConfig:
    """Get pipeline configuration."""
    conf = yaml.safe_load(_PIPELINE_CONF)
    return PipelineBuilder.from_dict(conf)

pipeline_config = get_pipeline_config()

In [14]:
%%time

predictor = predictors.ApproximatingLatest(keep_history=False)
resolver = Resolver.get_adviser_instance(
    predictor=predictor,
    project=project,
    recommendation_type=RecommendationType.LATEST,  # Use "latest" recommendation type, has no effect in pipeline units used.
    limit=1,  # Limit number of software stacks scored.
    count=1,  # We want just one software stack to be shown in the final report.
    beam_width=None,  # No limitation in memory consumption for internal resolver states.
    pipeline_config=pipeline_config,
)

2020-10-27 23:36:43,287 1354697 INFO     alembic.runtime.migration:155: Context impl PostgresqlImpl.
2020-10-27 23:36:43,288 1354697 INFO     alembic.runtime.migration:162: Will assume transactional DDL.


CPU times: user 81.3 ms, sys: 4 ms, total: 85.3 ms
Wall time: 95.4 ms


In [15]:
%%time

random.seed(30)  # Set seed to have reproducible results across runs.
resolver.graph.cache_clear()  # Clear the cache so it does not affect speed in multiple invocations.
report = resolver.resolve(with_devel=False, user_stack_scoring=False)

2020-10-27 23:36:43,385 1354697 INFO     thoth.adviser.resolver:1083: No scoring done on user's stack - see https://thoth-station.ninja/j/user_stack
2020-10-27 23:36:43,386 1354697 INFO     thoth.adviser.resolver:1085: Preparing initial states for the resolution pipeline
2020-10-27 23:36:43,387 1354697 INFO     thoth.adviser.resolver:618: Resolving direct dependencies
2020-10-27 23:36:43,409 1354697 INFO     thoth.adviser.resolver:653: Found direct dependency 'tensorflow' with version specification '==2.1.0'
2020-10-27 23:36:43,416 1354697 INFO     thoth.adviser.resolver:1089: Hold tight, Thoth is computing recommendations for your application...
2020-10-27 23:36:50,638 1354697 INFO     thoth.adviser.resolver:1196: Pipeline reached 1 final states out of 1 requested in iteration 4925 (pipeline pace 0.14 stacks/second); top rated software stack in beam has a score of 0.00; top rated software stack found so far has a score of 0.00
2020-10-27 23:36:51,073 1354697 INFO     thoth.adviser.res

CPU times: user 5.19 s, sys: 174 ms, total: 5.36 s
Wall time: 7.7 s


In [16]:
yaml.safe_dump(report.to_dict(), sys.stdout, sort_keys=True, indent=2)

accepted_final_states_count: 1
discarded_final_states_count: 0
pipeline:
  boots:
  - configuration:
      package_name: null
    name: PythonVersionBoot
    unit_run: true
  - configuration:
      package_name: null
    name: RHELVersionBoot
    unit_run: true
  - configuration:
      default_platform: linux-x86_64
    name: PlatformBoot
    unit_run: true
  - configuration:
      package_name: null
    name: FullySpecifiedEnvironment
    unit_run: true
  pseudonyms: []
  sieves:
  - configuration:
      package_name: null
      without_error: true
    name: SolvedSieve
    unit_run: true
  steps: []
  strides: []
  wraps: []
products:
- advised_manifest_changes: []
  advised_runtime_environment: null
  justification: []
  project:
    requirements:
      dev-packages: {}
      packages:
        tensorflow: ==2.1.0
      requires: &id001
        python_version: '3.6'
      source:
      - name: pypi
        url: https://pypi.org/simple
        verify_ssl: true
      - name: pypi-org
 