# Thoth 0.5.0 - Example 3 Guided Notebook

This notebook is a supportive material for example 3 of Thoth's 0.5.0 release.

See internal document for more info and clarification in [Google Docs](https://docs.google.com/document/d/1QflQpGXtOuHFFC2hkEFBlu0JmCQWEnXdgryvwxCNXpQ/edit#) - section "Example 3".

## Initial graph database setup

In order to go through this scenario, first we need connect to a graph database instance. This notebook is playeble from within your computer, it inserts all the data into a provided JanusGraph instance, so select a graph database instance you would like to use. If you want to run this script purely on your local machine, setup your local graph database as [described in the README file of thoth-station/janusgraph-thoth-config repo](https://github.com/thoth-station/janusgraph-thoth-config#running-janusgraph-instance-locally). Ideally, just clone the repo and issue the following command to setup your local JanusGraph database instance:

```
sudo ./local.sh all
```

In [None]:
# Configure JanusGraph instance to talk to:
JANUSGRAPH_SERVICE_HOST = 'localhost'

# For directly talking to test environment, uncomment the following line:
# JANUSGRAPH_SERVICE_HOST = 'janusgraph.test.thoth-station.ninja'

Now let's connect to desired JanusGraph database and check if we are properly connected:

In [None]:
from thoth.storages import GraphDatabase

# Instantiate and connect the JanusGraph database.
graph = GraphDatabase.create(JANUSGRAPH_SERVICE_HOST)
graph.connect()

graph.is_connected()

In the next step we download result of a [thoth-solver](https://github.com/thoth-station/solver) run which resolved all the required stacks (Flask, PyYAML and all their transitive dependencies) to this date.

In [None]:
import requests
from thoth.common import timestamp2datetime

SOLVER_DOCUMENT_URL = 'https://raw.githubusercontent.com/thoth-station/misc/master/examples/scoring/resolved.json'

response = requests.get(SOLVER_DOCUMENT_URL)
response.raise_for_status()
solver_document = response.json()

print("Document covers all transitive packages which can be installed when %r is installed (no version specifier, any version)." % solver_document["metadata"]["arguments"]["pypi"]["requirements"])
print("Stacks were resolved at", timestamp2datetime(solver_document["metadata"]["timestamp"]))

This document states packages and resolved dependencies. These data are used to construct entries inside graph database which are used by Thoth's resolver to construct software stacks. Note the construction of stacks is done offline on high frequencies (installation of Python packages is slow). The solver document captures environment details for which solver was run (see other Thoth notebooks for clarification why is that happening).

In [None]:
from thoth.storages import SolverResultsStore

document_id = SolverResultsStore.get_document_id(solver_document)
solver_name = SolverResultsStore.get_solver_name_from_document_id(document_id)
graph.parse_python_solver_name(solver_name)

Let's sync the solver document into graph so we can later on use these dependency graphs:

In [None]:
%%time

graph.sync_solver_result(solver_document)

## Stack generation and scoring

In this noteboook we will show scoring of a software stack and how Thoth finds an optimal stack for you.

Imagine a stack made out of two libraries "`simplelib`" and "`anotherlib`". They together influence how software which uses them works. On the figure below, there are shown scores a scoring function gave when there were used different versions of libraries mentioned above.

![alt text](https://raw.githubusercontent.com/thoth-station/misc/master/fig/score_3d_raw.png "Scoring function visualization")

As the image above is not easily readable, let's interpolate results - this way we will see a surface the scoring function is creating considering different versions of `simplelib` and `anotherlib`. 

![alt text](https://raw.githubusercontent.com/thoth-station/misc/master/fig/score_3d.png "Scoring function visualization")

With the theoretical example above, let's demo this using Thoth. Our scoring function will be "how many CVEs are present in a software stack". Even though CVEs are "low hanging fruit", they can nicely show approaches Thoth is performing. In the chapters below, we will show how this approach can be extended considering additional vector in the scoring function, such as performance characteristicts of a software stack. From now on, let's imagine `simplelib` is Flask and `anotherlib` is PyYAML. Let's assume other dependencies do not have 
With the theoretical example above, let's demo this using Thoth. Our scoring function will be "how many CVEs are present in a software stack". Even though CVEs are "low hanging fruit", they can nicely show approaches Thoth is performing. In the chapters below, we will show how this approach can be extended considering additional vector in the scoring function, such as performance characteristicts of a software stack. From now on, let's imagine `simplelib` is Flask and `anotherlib` is PyYAML. Let's assume other dependencies (the transitive ones of Flask and PyYAML) do not have any CVEs so we can project stacks into a 3D space as shown above (if there would be another library with a CVE, we would need to add a new dimension).

To demo the use case, let's create an application which uses Python libraries - [Flask](https://pypi.org/project/Flask/) and [PyYAML](https://pypi.org/project/PyYAML/). As Thoth has a resolver implemented on top of graph database, it can generate software stacks and score them on high frequencies (see for example [`libdependency_graph.so` library which Thoth uses under the hood](https://github.com/thoth-station/adviser/blob/master/docs/libdependency_graph.md)) and perform scoring based on observations.

In one of the cells above, we fed [thoth-solver](https://github.com/thoth-station/solver/) results into graph database so that we have a notion about all the transitive dependencies of Python packages our user's application uses. 

Next, let's insert some CVE related information.

Thoth, as of now, uses [pyup.io](https://pyup.io/)'s safety database which curates CVEs in Python ecosystem. The  sync of this database into Thoth's knowledge base is done by Thoth's [cve-update-job](https://github.com/thoth-station/cve-update-job) component in a deployment. As we work in a Jupyter Notebook, possibly talking to originally un-initialized graph database instance, let's manually insert some CVEs into our database:

In [None]:
# Download pyup.io safety database:
import requests

SAFETY_DB_URL = "https://raw.githubusercontent.com/pyupio/safety-db/master/data/insecure_full.json"

response = requests.get(SAFETY_DB_URL)
response.raise_for_status()
cve_database = response.json()

# We will use Thoth's resolver implemented on top of graph
# database to resolve packages which are affected by CVEs we are interest in to demo:
from thoth.python import PackageVersion
from thoth.adviser.python.solver import PythonPackageGraphSolver

# Instantiate Thoth's graph solver:
graph_solver = PythonPackageGraphSolver(graph_db=graph)

for cve in cve_database.get("flask", []):
    print("---> Syncing Flask CVE record into graph database, cve: %r (CVE id: %r)" % (cve["id"], cve["cve"]))
    print("CVE %r is affecting Flask versions %r" % (cve["id"], cve["v"]))
    
    # Resolve versions of affected Flask versions:
    versions = graph_solver.solve([PackageVersion(name="flask", version=cve["v"], develop=False)], all_versions=True)
    for package_version in versions['flask']:
        print("\tCreating record for affected Flask version: %r" % (package_version.locked_version))
        graph.create_python_cve_record(
            package_name=package_version.name,
            package_version=package_version.locked_version,
            index_url="https://pypi.org/simple",  # This is the index monitored by pyup.io safety db.
            record_id=cve["id"],
            version_range=cve["v"],
            advisory=cve["advisory"],
            cve=cve["cve"],
        )

There is also a CVE with CVE ID [CVE-2017-18342](https://nvd.nist.gov/vuln/detail/CVE-2017-18342) which is affecting the most recent versions of PyYAML (version lower than 4.1 and, for our use case, higher than 3.05). The fix is present also in pre-releases of PyYaml and [the most recent version of PyYAML to this date 3.13](https://pypi.org/project/PyYAML/#history) is also vulnarable by this CVE.

The pyup.io does not state this CVE (too recent probably?), so let's insert it manually for now to the graph database:

In [None]:
print("---> Syncing PyYAML CVE record into graph database, cve: CVE-2017-18342")

# Resolve versions of affected PyYAML versions:
versions = graph_solver.solve([PackageVersion(name="pyyaml", version="<4.1,>3.05", develop=False)], all_versions=True)
for package_version in versions['pyyaml']:
    print("\tCreating record for affected PyYAML version: %r" % (package_version.locked_version))
    graph.create_python_cve_record(
        package_name=package_version.name,
        package_version=package_version.locked_version,
        index_url="https://pypi.org/simple",
        record_id="CVE-2017-18342",
        version_range="<4.1,>3.05",
        advisory="In PyYAML before 4.1, the yaml.load() API could execute arbitrary code. In other words, yaml.safe_load is not used.",
        cve="CVE-2017-18342",
    )

We have fed all the data necessary for this notebook into our graph database instance. Let's have a look at user's stack by inspecting `Pipfile` and `Pipfile.lock`:

In [None]:
PIPFILE_URL = "https://raw.githubusercontent.com/thoth-station/thamos/master/examples/scoring/Pipfile"
PIPFILE_LOCK_URL = "https://raw.githubusercontent.com/thoth-station/thamos/master/examples/scoring/Pipfile.lock"

response = requests.get(PIPFILE_URL)
response.raise_for_status()
pipfile_str = response.text

response = requests.get(PIPFILE_LOCK_URL)
response.raise_for_status()

pipfile_lock_str = response.text

Direct dependencies which user directly uses in her/his application with configured index:

In [None]:
print(pipfile_str)

And the corresponding lockfile for user's stack.

In [None]:
print(pipfile_lock_str)

Thoth internally operates on "Project" abstraction, so let's instantiate one:

In [None]:
from thoth.python import Project

project = Project.from_strings(pipfile_str, pipfile_lock_str)
project.to_dict()

Let's have a look whave versions of PyYAML and Flask libraries are used:

In [None]:
print("Flask is used in version %r " % project.get_locked_package_version("flask").locked_version)

In [None]:
print("PyYAML is used in version %r " % project.get_locked_package_version("pyyaml").locked_version)

As you can see, CVEs known affect both versions of Flask and PyYAML. If you would take a closer look at stacks, evaluate version ranges of CVEs manually for the affected packages, you would come up with a solution to have a CVE-free software stack. The solution would be:

* Update Flask to version >0.12.2
* Downgrage PyYAML to lower version without CVE

Or:

* Update Flask to version >0.12.2
* Use a pre-release of PyYAML to get rid of PyYAML CVE vulnerability

As user did not configured pre-releases (`allow_prereleases` configuration option in Pipfile file, or run `pipenv install --pre` - see `Pipfile` at the end of this notebook for an example).

Now, we can let Thoth compute the best possible software stack. You can adjust configuration options for `Adviser`. Without any limitations, there are possibly 46,997,280 different stacks (upper bound, estimated based on number of packages and all the combinations they can create). Howerver, you can adjust parameters so that this number is lower (e.g. reducing number of recent versions for each package - `limit_latest_versions`).

Let's give adviser a try:

In [None]:
%%time
%env THOTH_ADVISER_SHOW_PACKAGES=1

from thoth.adviser.python import Adviser
from thoth.adviser.enums import RecommendationType

stack_info, report = Adviser.compute_on_project(
    project,
    recommendation_type=RecommendationType.STABLE,
    count=5,                   # Number of best stacks reported in the output.
    limit=None,                # Limit number of stacks scored in total.
    limit_latest_versions=4,   # Consider only first two latest versions of each package.
    dry_run=False,
    graph=graph,
)

Parameter `count` limits number of stacks provided in the output, parameter `limit` limits numbef of stacks scored in total.

The step above did not considered any runtime environment (user did not configure it in Thoth's configuration file). This means Thoth is assuming "any" runtime environment - this is also reported back to user as a recommendation.

In [None]:
stack_info

Also, take a look at the most important part - generated reports with stacks and guidenance/justifications:

In [None]:
report

Most of the tools out there expect there is a fix for a CVE in a more recent version of a package, which is obviously, not always true. As Thoth is a pro-active system (we use bots), when there is a new vulnerability, it can directly recommend to *downgrade* version of a package.

The scenario above showed how Thoth works solely on package-level - we have considered how different packages affect "security" of a software stack (how many CVEs are present in a software stack). This way we have proved "package-level guidenance" for stacks. The previous scenario - scenario 2: `runtime-environment` showed guidenance on software stack level - when we have information how well a group of packages work together (performace related advises). The current implementation of Thoth combines these two input vectors - security guidenance and performance guidenance in the `STABLE` scoring function.

Feel free to experiment with parameters to adviser:

In [None]:
from thoth.adviser import RecommendationType

# Set recommendation type to one of the following:
[e.name for e in RecommendationType]

In [None]:
# Adjust version ranges of Python packages being installed. These versions are resolved using
# pip's internal algorithm, so anything which is compatible with PEP-440 (and Pipfile compatible
# for Pipfile inputs) works out of box. Note this resolution is not done by installing
# dependencies (as in case of Pip/Pipenv), but there is implemented resolver on top of
# graph database which can resolve dependencies much faster as all the data are pre-computed.

PIPFILE_STR = """
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
flask = "*"
pyyaml = "*"


# Allow pre-releases.
[pipenv]
allow_prereleases = true
"""

# In the example above, there is also used Pipfile.lock. Actually Pipfile.lock is not used in
# recommendations, but Thoth stores it internally to track user's stacks, their evolution and changes.
project = Project.from_strings(PIPFILE_STR)
project.to_dict()

# Mind dependencies resolved in solver run, unknown dependencies to Thoth, obviusly, cannot be resolved by Thoth.

All of the above can be accomplished using Thamos CLI (as the above is more in-depth description what Thoth does on lower level). From user's perspective a user just installs `Thamos`, adjusts configuration via `thamos config` (automatic discovery of available hardware is performed) and issues `thamos advise` which talks to a Thoth deployment. All of the above is transparent to the user, the report is shown in a well formatted table. [Follow README instructions in thamos repo](https://github.com/thoth-station/thamos/tree/master/examples/scoring/) to experience this on your own.


Happy hacking! ;-)

In [None]:
from thoth.lab import packages_info

# Let's state Thoth's package versions for reproducible next runs of this Jupyter Notebook.
packages_info()