<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Context" data-toc-modified-id="Context-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Context</a></span></li><li><span><a href="#Goal" data-toc-modified-id="Goal-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Goal</a></span></li><li><span><a href="#Content" data-toc-modified-id="Content-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Content</a></span></li><li><span><a href="#How-you-can-use-the-Data" data-toc-modified-id="How-you-can-use-the-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>How you can use the Data</a></span></li><li><span><a href="#Import-packages" data-toc-modified-id="Import-packages-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Import packages</a></span></li><li><span><a href="#Retrieve-the-data" data-toc-modified-id="Retrieve-the-data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Retrieve the data</a></span></li><li><span><a href="#Explore-one-solver-report" data-toc-modified-id="Explore-one-solver-report-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Explore one solver report</a></span><ul class="toc-item"><li><span><a href="#Solver-report-metadata" data-toc-modified-id="Solver-report-metadata-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Solver report metadata</a></span><ul class="toc-item"><li><span><a href="#Consider-all-solver-reports" data-toc-modified-id="Consider-all-solver-reports-7.1.1"><span class="toc-item-num">7.1.1&nbsp;&nbsp;</span>Consider all solver reports</a></span></li></ul></li><li><span><a href="#Solver-report-result" data-toc-modified-id="Solver-report-result-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Solver report result</a></span></li></ul></li><li><span><a href="#Consider-all-solver-reports" data-toc-modified-id="Consider-all-solver-reports-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Consider all solver reports</a></span><ul class="toc-item"><li><span><a href="#Aggregated-Data-from-solver-reports" data-toc-modified-id="Aggregated-Data-from-solver-reports-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Aggregated Data from solver reports</a></span></li><li><span><a href="#Error-Data-from-solver-reports" data-toc-modified-id="Error-Data-from-solver-reports-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Error Data from solver reports</a></span></li></ul></li></ul></div>

# Context

Thoth [Solver Dataset](https://github.com/thoth-station/datasets/blob/master/notebooks/thoth-solver-dataset/thoth-solver-dataset-v1.0.zip) is part of a series of datasets related to observations regarding software stacks (e.g. dependency tree, installability, performance, security, health) as part of [Project Thoth](https://thoth-station.ninja/). All these datasets can be found also [here](https://github.com/thoth-station/datasets) where they are described and explored to facilitate their use. All these observations are created with different components which are part of [Project Thoth](https://thoth-station.ninja/) and stored in Thoth Knowledge Graph which is used by [Thoth Adviser](https://github.com/thoth-station/adviser) to provide advises on software stacks depending on User requirements.

# Goal
The goal is to provide datasets widely available and useful for data scientists. Thoth Team within the office of the CTO at Red Hat has collected datasets that can be made open source within the IT domain for training Machine Learning models.

# Content
Thoth [Solver Dataset](https://github.com/thoth-station/datasets/blob/master/notebooks/thoth-solver-dataset/thoth-solver-dataset-v1.0.zip) has been created with one of the components of Thoth called [Dependency Solver](https://github.com/thoth-station/solver) which tries to answer a simple question:
* what packages will be installed (resolved by pip or any Python compliant dependency resolver) for the provided stack?


# How you can use the Data
You can download and use this data for free for your own purpose, all we ask is three things

* you cite Thoth Team as the source if you use the data,
* you accept that you are solely responsible for how you use the data
* you do not sell this data to anyone, it is free!

## Set environment variables to access the datasets on Ceph

For more detail on the Operate First Ceph public bucket used here, visit https://github.com/operate-first/apps/blob/master/docs/odh/trino/access_public_bucket.md

In [None]:
%env THOTH_CEPH_KEY_ID=LLEzCoxu7pvjzO4inoL8
%env THOTH_CEPH_SECRET_KEY=1HnDVoIS2jt3h3xEpgeQlCX5+FeOUH0wOrvWVvZP
%env THOTH_CEPH_BUCKET_PREFIX=thoth
%env THOTH_S3_ENDPOINT_URL=https://s3-openshift-storage.apps.smaug.na.operate-first.cloud
%env THOTH_CEPH_BUCKET=opf-datacatalog
%env THOTH_DEPLOYMENT_NAME=datasets

# Import packages

In [None]:
from thoth.report_processing.components.solver import Solver
import pandas as pd

In [None]:
solver_reports = Solver.aggregate_solver_results()

# Explore one solver report

Each solver report is created for a specific package (e.g Python package from a certain index in a certain version),
*solved* using a certain solver. 

**What is a solver?**
In Thoth language a solver example is `solver-fedora-31-py37` which is named after:
* **operating system** used (e.g Fedora 31)
* **Python interpreter** installed (e.g. Python 3.7)

on which the specific **Python package** is going to be installed.

In [None]:
solver_report = solver_reports['solver-fedora-31-py37-0870237d.json']

Each solver report is made by two main parts:
* **metadata** where information about dependency solver itself are stored (e.g version running, type of solver)
* **result** where the inputs and outputs of solver are actually collected 

## Solver report metadata

All the metadata available for each solver report are described below:
* **analyzer**, name of the analyzer;
* **analyzer_version**, analyzer version;
* **arguments**, arguments for the analyzer;
    * **python** specific inputs regarding the package to be analyzed (aka solved in this case);
    * **dependency-solver** specific inputs;
* **datetime**, when the solver report has been created;
* **distribution**, operating system specific info;
* **document_id**, unique ID of the solver report which includes the solver used (e.g. solver-fedora-31-py37);
* **duration**, duration of the solver run for a certain Python Package;
* **hostname**, Container name where the solver was run;
* **os_release**, OS info;
* **python**, Python Inrpreter info;
* **thoth_deployment_name**, Thoth architecture specific info;
* **timestamp**;

In [None]:
pd.DataFrame([solver_report["metadata"]])

In [None]:
solver_subset_metadata = Solver.extract_data_from_solver_metadata(solver_report["metadata"])
pd.DataFrame([solver_subset_metadata])

### Consider all solver reports

In [None]:
solver_reports_metadata = []
for solver_document in solver_reports:
    solver_reports_metadata.append(
        Solver.extract_data_from_solver_metadata(solver_reports[solver_document]["metadata"])
    )

solver_reports_metadata_df = pd.DataFrame(solver_reports_metadata)

solver_reports_metadata_df.head()

## Solver report result

All the result in solver report are described below:
* **environment**, information about the environment on which the package has being solved;
* **environment_packages**, information about external packages installed on the environment;
* **errors**, if the installation of a package was not succesfull there will be information stored for each package error;
    * **details**,
        * **command**,
        * **message**,
        * **return_code**,
        * **stderr**,
        * **stdout**,
        * **timeout**,
    * **index_url** from where the package was download;
    * **package_name**;
    * **package_version**;
    * **is_provided_package**, flag for storing package;
    * **is_provided_package_version**, flag for storing package;
    * **type**, error type;
* **tree**, all the packages installed in the dependency tree and information about them;
    * **dependencies**
    * **metadata** of the package as taken from `importlib_metadata`;
    * **index_url** from where the package was download;
    * **package_name**;
    * **package_version**;
    * **sha256**;
* **unparsed**, if there are packages in the tree that could not be parsed;
* **unresolved**, if there are packages in the tree that could not be solved;

In [None]:
pd.DataFrame([solver_report["result"]])

In [None]:
pd.DataFrame([solver_report["result"]["environment"]])

# Consider all solver reports

In [None]:
solver_reports_extracted_data = []
solver_errors = []
for solver_document in solver_reports:
    solver_report_extracted_data = Solver.extract_data_from_solver_metadata(
        solver_reports[solver_document]["metadata"]
    )
    for k, v in solver_reports[solver_document]["result"].items():
        solver_report_extracted_data[k] = v
        if k == "errors" and v:
            errors = Solver.extract_errors_from_solver_result(v)
            for error in errors:
                solver_errors.append(error)
    
    packages = Solver.extract_tree_from_solver_result(solver_reports[solver_document]["result"])
    solver_report_extracted_data["packages"] = packages
    solver_reports_extracted_data.append(solver_report_extracted_data)

## Aggregated Data from solver reports

In [None]:
solver_reports_metadata_df = pd.DataFrame(solver_reports_extracted_data)

solver_reports_metadata_df.head()

## Error Data from solver reports

In [None]:
solver_total_errors_df = pd.DataFrame(solver_errors)

solver_total_errors_df.head()