<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Context" data-toc-modified-id="Context-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Context</a></span></li><li><span><a href="#Goal" data-toc-modified-id="Goal-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Goal</a></span></li><li><span><a href="#Content" data-toc-modified-id="Content-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Content</a></span></li><li><span><a href="#How-you-can-use-the-Data" data-toc-modified-id="How-you-can-use-the-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>How you can use the Data</a></span></li><li><span><a href="#Import-packages" data-toc-modified-id="Import-packages-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Import packages</a></span></li><li><span><a href="#Retrieve-the-data" data-toc-modified-id="Retrieve-the-data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Retrieve the data</a></span></li><li><span><a href="#Explore-one-inspection-batch-and-inspection-report" data-toc-modified-id="Explore-one-inspection-batch-and-inspection-report-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Explore one inspection batch and inspection report</a></span></li><li><span><a href="#Analysis-of-inspection-batches" data-toc-modified-id="Analysis-of-inspection-batches-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Analysis of inspection batches</a></span><ul class="toc-item"><li><span><a href="#Error-analysis" data-toc-modified-id="Error-analysis-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Error analysis</a></span></li><li><span><a href="#Interpolated-Data" data-toc-modified-id="Interpolated-Data-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Interpolated Data</a></span><ul class="toc-item"><li><span><a href="#Standard-Deviation-analysis" data-toc-modified-id="Standard-Deviation-analysis-8.2.1"><span class="toc-item-num">8.2.1&nbsp;&nbsp;</span>Standard Deviation analysis</a></span></li></ul></li></ul></li><li><span><a href="#Create-reports-from-inspections" data-toc-modified-id="Create-reports-from-inspections-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Create reports from inspections</a></span></li><li><span><a href="#Visualize-performance-analysis-for-software-stacks" data-toc-modified-id="Visualize-performance-analysis-for-software-stacks-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Visualize performance analysis for software stacks</a></span><ul class="toc-item"><li><span><a href="#Create-plots" data-toc-modified-id="Create-plots-10.1"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>Create plots</a></span><ul class="toc-item"><li><span><a href="#Plots-inputs" data-toc-modified-id="Plots-inputs-10.1.1"><span class="toc-item-num">10.1.1&nbsp;&nbsp;</span>Plots inputs</a></span></li><li><span><a href="#TensorFlow" data-toc-modified-id="TensorFlow-10.1.2"><span class="toc-item-num">10.1.2&nbsp;&nbsp;</span>TensorFlow</a></span></li><li><span><a href="#PyTorch" data-toc-modified-id="PyTorch-10.1.3"><span class="toc-item-num">10.1.3&nbsp;&nbsp;</span>PyTorch</a></span></li><li><span><a href="#Tensorflow-vs-Pytorch" data-toc-modified-id="Tensorflow-vs-Pytorch-10.1.4"><span class="toc-item-num">10.1.4&nbsp;&nbsp;</span>Tensorflow vs Pytorch</a></span></li></ul></li></ul></li></ul></div>

# Context

Thoth Performance Dataset is part of a series of datasets related to observations regarding software stacks (e.g. dependency tree, installability, performance, security, health) as part of [Project Thoth](https://thoth-station.ninja/). All these datasets can be found also [here](https://github.com/thoth-station/datasets) where they are described and explored to facilitate their use. All these observations are created with different components which are part of [Project Thoth](https://thoth-station.ninja/) and stored in Thoth Knowledge Graph which is used by [Thoth Adviser](https://github.com/thoth-station/adviser) to provide advises on software stacks depending on User requirements.

# Goal
The goal is to provide datasets widely available and useful for data scientists. Thoth Team within the office of the CTO at Red Hat has collected datasets that can be made open source within the IT domain for training Machine Learning models.

# Content
Thoth Performance Dataset has been created with one of the components of Thoth called [Amun](https://github.com/thoth-station/amun-api). This service acts as an execution engine for Thoth where applications are built and tested using [Thoth Performance Indicators (PI)](https://github.com/thoth-station/performance). Amun can be scheduled through another component in Thoth called [Dependency Monkey](https://github.com/thoth-station/adviser/blob/master/docs/source/dependency_monkey.rst). This component aims to automatically verify software stacks and aggregate relevant observations. Thoth Performance Dataset contains tests on performance for software stacks for different types of applications (e.g Machine Learning).


# How you can use the Data
You can download and use this data for free for your own purpose, all we ask is three things

* you cite Thoth Team as the source if you use the data,
* you accept that you are solely responsible for how you use the data
* you do not sell this data to anyone, it is free!

# Set environment variables to access the datasets on Ceph

For more detail on the Operate First Ceph public bucket used here, visit https://www.operate-first.cloud/apps/content/odh/trino/access_public_bucket.html

In [None]:
%env THOTH_CEPH_KEY_ID=LLEzCoxu7pvjzO4inoL8
%env THOTH_CEPH_SECRET_KEY=1HnDVoIS2jt3h3xEpgeQlCX5+FeOUH0wOrvWVvZP
%env THOTH_CEPH_BUCKET_PREFIX=thoth
%env THOTH_S3_ENDPOINT_URL=https://s3-openshift-storage.apps.smaug.na.operate-first.cloud
%env THOTH_CEPH_BUCKET=opf-datacatalog
%env THOTH_DEPLOYMENT_NAME=datasets

# Import packages

In [None]:
from thoth.report_processing.components.inspection import AmunInspections
from thoth.report_processing.components.inspection import AmunInspectionsSummary
from thoth.report_processing.components.inspection import AmunInspectionsStatistics
from thoth.report_processing.components.inspection import AmunInspectionsFailedSummary

inspection = AmunInspections()
inspection_runs_summary = AmunInspectionsSummary()
inspection_statistics = AmunInspectionsStatistics()
inspections_failed_sumary = AmunInspectionsFailedSummary()

import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1500)
pd.options.plotting.backend = "plotly"  # Convert to matplotlib

Every inspection obtained with Argo workflow through Amun (when successfull) has the following structure:

- `inspection id`
    - **build**
        - *Dockerfile*
        - *log*
        - *specification*
    - **results**
        - **0**
            - *hwinfo*
            - *log*
            - *result*
        - **1**
            - *hwinfo*
            - *log*
            - *result*

where total results depends on the `batch_size` selected when running Amun.

In [None]:
inspection_runs = inspection.aggregate_thoth_inspections_results(
    store_files=['specification', 'build_logs', 'job_logs', 'hardware_info', 'results']
)

In [None]:
inspection_run = inspection_runs['inspection-iotf-1-conv1d-2a71492d']

# Explore one inspection batch and inspection report

Each inspection batch is created using Amun directly or started through Dependency Monkey, which is scheduling different inspection batches depending on the purpose of the analysis. 

The inputs that can be provided to Amun API are:
    
* **Base Image** (e.g. rhel8, ubi8, thoth-ubi8-python36)
* **RPMs/Debian packages List**
* **Pinned Down Software Stack** (Pipfile/Pipfile.lock)
* **Hardware Requirement** (e.g. CPU only, GPU)
* **Performance Indicator (PI) and parameters**

In [None]:
inspection_batch_report = inspection_run['results'][0]['result']

In each result it is possible to find the following info:
* **start_datetime**, when the inspection started;
* **end_datetime**, when the inspection ended;
* **document_id**, Document ID;
* **identifier**, Inspection identifier;
* **hwinfo**, hardware information where the inspection has been run;
    * **cpu_features**, flags, Frequency, l1, l2 ,l3 cache sizes [KB];
    * **cpu_info**, CPU info (e.g brand, vendor_id, family, model);
    * **cpu_type**, flags identifying CPU Type (e.g. 'is_XEON': True);
    * **platform**;
        * **architecture**;
        * **machine**;
        * **node**;
        * **platform**;
        * **release**;
        * **version**;
        * **processor**;
* **os_release**, OS info taken from `"/etc/os-release"`;
* **runtime_environment**, runtime environment info;
    * **cuda_version**, CUDA version;
    * **hardware**, HW info, cpu family and model;
    * **operating_system**, OS name and version;
    * **python_version**;
* **script_sha256**, unique ID of the Performance Indicator used;
* **stdout**;
    * **@parameters**, parameters specific of the PI;
    * **@results**, results after running the PI (rate[GFLOPS] and elapsed time [ms]);
    * **component**, for what component or library (e.g tensorflow, pytorch);
    * **name**, name of the PI (e.g. PiConv2D);
    * **{component}_buildinfo**, build info for the specific component (e.g AICoE Tensorflow);
* **requirements**, e.g Pipfile;
* **requirements_locked** e.g Pipfile.lock;
* **stderr**;
* **exit_code**;
* **usage** resource usage for a process or child as given by resource.getrusage() https://docs.python.org/3.6/library/resource.html#resource.getrusage;

# Analysis of inspection results

In [None]:
processed_inspection_runs, failed_inspection_runs = inspection.process_inspection_runs(
    inspection_runs,
)

In [None]:
for inspection processed_inspection_runs

In [None]:
inspections_df = inspection.create_inspections_dataframe(
    processed_inspection_runs=processed_inspection_runs,
)

In [None]:
inspections_df.head()

# Create reports from inspections

In [None]:
report_results, _ = inspection_runs_summary.produce_summary_report(inspections_df=inspections_df)

## Hardware

In [None]:
report_results["hardware"]['platform'].head()

In [None]:
report_results["hardware"]['processor']

In [None]:
report_results["hardware"]['flags']

In [None]:
report_results["hardware"]['ncpus']

In [None]:
report_results["hardware"]['info']

## Operating System

In [None]:
report_results["base_image"]['base_image']

In [None]:
report_results["base_image"]['number_cpus_run']

## Performance Indicators

In [None]:
report_results["pi"]['pi']

# Software Stack

In [None]:
report_results["software_stack"]['requirements_locked'].head()

In [None]:
python_packages_dataframe, _ = inspection.create_python_package_df(inspections_df=inspections_df)
python_packages_dataframe.head()