Use solver dataset provided by Thoth Dependency Solver, so that we can derive context on why dependencies cannot be solved in order to better advise users on why something cannot be used.
The aim of Thoth solver is to answer a simple question - what packages will be installed (resolved by pip or any Python compliant dependency resolver) for the provided stack?
What is a solver? In Thoth language a solver example is solver-fedora-31-py37 which is named after:
- operating system used (e.g Fedora 31)
- Python interpreter installed (e.g. Python 3.7)
on which the specific Python package is going to be installed.
Thoth Solver Dataset is part of a series of datasets related to observations regarding software stacks (e.g. dependency tree, installability, performance, security, health) as part of Project Thoth.
Thoth Solver Dataset is made by solver reports in json format, where each solver report is created for a specific package (e.g Python package from a certain index in a certain version), solved using a certain solver and it is described in the notebook called Thoth Solver Dataset.
If the installation of a package was not succesfull there will be information stored for each package error;
- command,command used to install the package;
- message,error log;
- return_code,
- stderr,
- stdout,
- timeout,
- index_url, from where the package was download;
- package_name;
- package_version;
- type, error type;
These notebooks can be parametrized using papermill. In this way, we can run the template notebooks automatically locally or in Argo workflow.
1. Template notebook to pre-process solver dataset and output clean dataset which is the input for clustering : notebooks/PreprocessSolverErrorData.ipynb
The purpose of this notebook is to preprocess solver data, i.e, extract error data from solver data, prepare data for clustering, clean and tokenize the clustering data and save the clean dataset for ClusterError notebook.
Excuecute via CLI:
papermill PreprocessSolverErrorData.ipynb PreprocessSolverErrorDataOutput.ipynb -p get_fresh_data False
where get_fresh_data
is the parameter which decides if we need to get fresh data from Ceph or use the data from csv file.
2. Template notebook to cluster errors to identify the type of errors that can appear in solver reports : notebooks/ClusterErrors.ipynb
The purpose of this notebook is to cluster solver errors in order to identify the type of errors that can appear in solver reports.
Excuecute via CLI:
papermill ClusterErrors.ipynb ClusterErrorsOutput.ipynb -p preprocessed_filename 'error-clean-data.csv'
where preprocessed_filename
is the parameter which contains the filename of preprocessed clean data.
1. Thoth: https://thoth-station.ninja/
2. Thoth GitHub: https://github.com/thoth-station
3. Solver: https://github.com/thoth-station/solver
4. Thoth Solver dataset on Kaggle: https://www.kaggle.com/thothstation/thoth-solver-dataset-v10
5. Thoth Solver dataset on GitHub: https://github.com/thoth-station/datasets