# Thoth-solver dataset

- Contains datasets of software stacks observations. 
- Provides information about dependency tree, installability, performance, security, etc.
- All of them were created by various parts of Project Thoth and are stored in Thoth Knowledge Graph.
- Was created by [Thoth Dependency Solver](https://github.com/thoth-station/solver) and answers the question:

    What packages will be installed for the provided stack?

Following dataset can be easily accessed through:
- [Thoth datasets github](https://github.com/thoth-station/datasets/tree/master/notebooks/thoth-solver-dataset)
- [Kaggle](https://www.kaggle.com/thothstation/thoth-solver-dataset-v10)


## Goal 

The ultimate goals is to provide useful and easily available datasets for data scientist to train Machine Learning models.

## How to use the Data

In order to use provided data:
- cite Thoth Team as the source if you use the data
- accept that you are solely responsible of how you use the data and
- do not sell this data to anyone, it is free!

# Set environment variables to access the datasets on Ceph

For more detail on the Operate First Ceph public bucket used here, visit https://www.operate-first.cloud/apps/content/odh/trino/access_public_bucket.html

In [None]:
%env THOTH_CEPH_KEY_ID=LLEzCoxu7pvjzO4inoL8
%env THOTH_CEPH_SECRET_KEY=1HnDVoIS2jt3h3xEpgeQlCX5+FeOUH0wOrvWVvZP
%env THOTH_CEPH_BUCKET_PREFIX=thoth
%env THOTH_S3_ENDPOINT_URL=https://s3-openshift-storage.apps.smaug.na.operate-first.cloud
%env THOTH_CEPH_BUCKET=opf-datacatalog
%env THOTH_DEPLOYMENT_NAME=datasets

## Import packages

In [None]:
from thoth.report_processing.components.solver import Solver
import pandas as pd

## Access the data

In [None]:
solver_reports = Solver.aggregate_solver_results()

## Access one solver report

Each of reports is created for a specific package and solved using a certain solver.

In this context **solver** example is solver-fedora-34-py-39 that is named after:
- operating system used (e.g. Fedora 34)
- Python interpreter installed (e.g. Python 3.9)

on which **specified Python package** will be installed.

In [None]:
solver_report = solver_reports['solver-rhel-8-py38-210712140154-9e9eab93c147ecab']


Every solver run result consists of:
- **metadata** that has information of dependency solver itself
- **result** that has actual inputs and outputs of solver

In [None]:
solver_report

## Metadata

Solver report metadata has following information:
- **analyzer**, name of the analyzer;
- **analyzer_version**, analyzer version;
- **arguments**, arguments for the analyzer;
    - **python** specific inputs regarding the package to be analyzed (aka solved in this case);
    - **dependency-solver** specific inputs;
- **datetime**, when the solver report has been created;
- **distribution**, operating system specific info;
- **document_id**, unique ID of the solver report which includes the solver used (e.g. solver-fedora-31-py37);
- **duration**, duration of the solver run for a certain Python Package;
- **hostname**, Container name where the solver was run;
- **os_release**, OS info;
- **python**, Python Inrpreter info;
- **thoth_deployment_name**, Thoth architecture specific info;
- **timestamp**;


In [None]:
pd.DataFrame([solver_report["metadata"]])

In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
pd.DataFrame([solver_report["metadata"]])['arguments']

In [None]:
solver_subset_metadata = Solver.extract_data_from_solver_metadata(solver_report["metadata"])
pd.DataFrame([solver_subset_metadata])

## Access all available solver reports

In [None]:
solver_reports_metadata = []
for solver_document in solver_reports:
    solver_reports_metadata.append(
        Solver.extract_data_from_solver_metadata(solver_reports[solver_document]["metadata"])
    )

solver_reports_metadata_df = pd.DataFrame(solver_reports_metadata)

solver_reports_metadata_df.head()

## Solver report result

Report result contains following information:
- **environment**, information about the environment on which the package has being solved;
- **environment_packages**, information about external packages installed on the environment;
- **errors**, if the installation of a package was not succesfull there will be information stored for each package error;
    - **details**,
        - command,
        - message,
        - return_code,
        - stderr,
        - stdout,
        - timeout,
    - **index_url** from where the package was download;
    - **package_name**;
    - **package_version**;
    - **is_provided_package**, flag for storing package;
    - **is_provided_package_version**, flag for storing package;
    - **type**, error type;
- **tree**, all the packages installed in the dependency tree and information about them;
    - **dependencies**
    - **metadata** of the package as taken from importlib_metadata;
    - **index_url** from where the package was download;
    - **package_name**;
    - **package_version**;
    - **sha256**;
    - **platform** description (introduced in this version)
    - **packages** called list (introduced in this version)
- **unparsed**, if there are packages in the tree that could not be parsed;
- **unresolved**, if there are packages in the tree that could not be solved;


In [None]:
pd.DataFrame([solver_report["result"]])

In [None]:
pd.DataFrame([solver_report["result"]["environment"]])

Look into environment packages for particular solver report

In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
env_packs = pd.DataFrame([solver_report["result"]["environment_packages"]])

In [None]:
print(env_packs)

## Consider all solver reports

In [None]:
solver_reports_extracted_data = []
solver_errors = []
for solver_document in solver_reports:
    solver_report_extracted_data = Solver.extract_data_from_solver_metadata(
        solver_reports[solver_document]["metadata"]
    )
    for k, v in solver_reports[solver_document]["result"].items():
        solver_report_extracted_data[k] = v
        if k == "errors" and v:
            errors = Solver.extract_errors_from_solver_result(v)
            for error in errors:
                solver_errors.append(error)
    
    packages = Solver.extract_tree_from_solver_result(solver_reports[solver_document]["result"])
    solver_report_extracted_data["packages"] = packages
    solver_reports_extracted_data.append(solver_report_extracted_data)

In [None]:
solver_report["result"]

In [None]:
pd.set_option('display.max_colwidth', 50)
solver_reports_metadata_df = pd.DataFrame(solver_reports_extracted_data)
solver_reports_metadata_df.head(10)

## Packages under different names in import

To check packages in the ecosystem that provide modules under a different name than the package name itself we will compare data from:
- 'requirements' 
- 'packages'

In [None]:
solver_reports_metadata_df.loc[212]['requirements']

In [None]:
solver_reports_metadata_df.loc[212]['packages']

## Check all the available solver reports

In [None]:
nonmatching_packages = []
empty_packages = []
len_df = len(solver_reports_metadata_df)

for i in range(len_df):
    package_name_reqs_i = solver_reports_metadata_df.loc[i]['requirements'].split('==')[0]
    
    if len(solver_reports_metadata_df.loc[i]['packages']) == 0:
        package_name_i = ''
    else:   
        package_name_i = solver_reports_metadata_df.loc[i]['packages'][0]['package_name']
    
    if package_name_i != package_name_reqs_i:
#         print("Non-Matching")
        if package_name_i != '':
            nonmatching_packages.append([package_name_reqs_i,i,package_name_i])
            print(f'{package_name_reqs_i} != {package_name_i}')
        else:       
            empty_packages.append([package_name_reqs_i,i])
            print(f'{package_name_reqs_i} and {package_name_i}')
print(f'Number of packages that provide modules under a different name than the package name itself = {len(nonmatching_packages)} ')
print(f'Number of packages that have no packages specified = {len(empty_packages)} ')

Main differences: 
- Uppercase or lowercase
- '-' turned to '.'
- Empty package name in packages

In [None]:
nonmatching_packages

In [None]:
empty_packages

## Errors data from solver reports

In [None]:
solver_total_errors_df = pd.DataFrame(solver_errors)

solver_total_errors_df.head()