# Understanding OpenML API and Data


In [4]:
!pip install openml==0.15.1
!pip install selenium



## What is OpenML about?
1. OpenML focuses on tracking ML experiments (which model was run, what dataset was used, what hyperparameters were set).
2. But it does not store trained model weights or detailed documentation.
3. OpenML only stores basic metadata (like which algorithm was used and evaluation metrics).
4. It doesn’t include fairness, legal, or ethical considerations (unlike Hugging Face).
5. OpenML is designed for reproducibility (e.g., tracking experiment logs to rerun an experiment).
6. We can still get model information but it's limited to:
    - The model type (e.g., Decision Tree, Neural Network)
    - Hyperparameters (e.g., learning rate, max depth)
    - Evaluation metrics (e.g., accuracy, RMSE)
    - Code of the model pipeline (if uploaded)

#### DATASET

OpenML mainly hosts Datasets which can be used to train models.

Example: A dataset of house prices with features like square footage, location, and number of bedrooms.

In [5]:
import openml
datasets = openml.datasets.list_datasets(output_format="dataframe")

OpenML website is a single-page application (SPA) built with JavaScript frameworks like React. BeautifulSoup with requests doesnot include the dynamically loaded content. Selenium can render javascript.

In [6]:
import re 
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

def scrape_openml_stats(dataset_id):
    # Initialize Chrome WebDriver
    chrome_options = Options()
    chrome_options.add_argument("--headless=new")  # Using the new headless mode
    chrome_options.add_argument("--window-size=1920,1080")  # Set window size
    chrome_options.add_argument("--start-maximized")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    # Add user agent to appear more like a regular browser
    chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
    driver = webdriver.Chrome(options=chrome_options)
    
    try:
        # URL of the dataset page
        url = f"https://www.openml.org/search?type=data&id={dataset_id}"
        
        # Navigate to the page
        driver.get(url)
        
        # Wait for elements to load (timeout after 10 seconds)
        wait = WebDriverWait(driver, 10)
        
        # Find elements using aria-labels
        stats = {}
        for stat in ['status', 'downloads', 'likes', 'issues']:
            try:
                element = wait.until(
                    EC.presence_of_element_located(
                        (By.CSS_SELECTOR, f'span[aria-label="{stat}"]')
                    )
                )
                stats[stat] = element.text.strip()
            except TimeoutException:
                stats[stat] = "N/A"
        
        def extract_number(stat_string):
            return int(re.sub(r'\D', '', stat_string))
        
        # Print the extracted information
        stats['downloads'] = extract_number(stats['downloads'])
        stats['likes'] = extract_number(stats['likes'])
        stats['issues'] = extract_number(stats['issues'])

        return stats
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        
    finally:
        # Always close the browser
        driver.quit()

In [7]:
import time

dataset_id = 2
st = time.time()
dataset = openml.datasets.get_dataset(dataset_id)
print("Time taken to get dataset from openml: ", time.time()-st)

st = time.time()
dataset_stats = scrape_openml_stats(dataset_id)
print("Time taken to scrape: ", time.time()-st)

Time taken to get dataset from openml:  0.009972333908081055
Time taken to scrape:  6.8709588050842285


Crosswalk for openml dataset: https://docs.google.com/spreadsheets/d/1Z1-GS_omE1mo2-WzWrtENslKd8tUkKqmqP1AsziF4uo/edit?gid=1539101257#gid=1539101257

In [8]:
print("ID: ", dataset.id)
print("Name: ", dataset.name)
print("Status: ", dataset_stats['status'] , "or", datasets.loc[datasets['did'] == dataset_id, 'status'].values[0])
print("Format: ", dataset.format)
print("Licence: ", dataset.licence)
print("Citation: ", dataset.citation)
print("Language: ", dataset.language)
print("Date: ", dataset.upload_date)
print("Version: ", dataset.version)
print("Likes: ", dataset_stats['likes'])
print("Downloads: ", dataset_stats['downloads'])
print("Issues: ", dataset_stats['issues'])
print("Contributor: ", dataset.contributor, "or", dataset.creator) # Sometime we can find this info in the Description
print("Keywords: ", dataset.tag)
print("Features: ", dataset.features)
print("Qualities: ", dataset.qualities)
print("Download URL: ", dataset.url)
print("Paper URL: ", dataset.paper_url)
print("Description: ", dataset.description)

ID:  2
Name:  anneal
Status:  verified or active
Format:  ARFF
Licence:  Public
Citation:  https://archive.ics.uci.edu/ml/citation_policy.html
Language:  English
Date:  2014-04-06T23:19:24
Version:  1
Likes:  0
Downloads:  0
Issues:  0
Contributor:  David Sterling and Wray Buntine or ['David Sterling', 'Wray Buntine']
Keywords:  ['Data Science', 'Engineering', 'Manufacturing', 'Materials', 'study_1', 'study_14', 'study_34', 'study_37', 'study_41', 'study_70', 'study_76', 'test', 'uci']
Features:  {0: [0 - family (nominal)], 1: [1 - product-type (nominal)], 2: [2 - steel (nominal)], 3: [3 - carbon (numeric)], 4: [4 - hardness (numeric)], 5: [5 - temper_rolling (nominal)], 6: [6 - condition (nominal)], 7: [7 - formability (nominal)], 8: [8 - strength (numeric)], 9: [9 - non-ageing (nominal)], 10: [10 - surface-finish (nominal)], 11: [11 - surface-quality (nominal)], 12: [12 - enamelability (nominal)], 13: [13 - bc (nominal)], 14: [14 - bf (nominal)], 15: [15 - bt (nominal)], 16: [16 - bw

#### TASK — What Do You Want to Do With the Dataset?

A task in OpenML is a specific ML problem you want to solve using a dataset.

There are different types of tasks, like:
- Classification (e.g., predict if an email is spam or not)
- Regression (e.g., predict house prices)
- Clustering (e.g., group customers into categories)

A task defines:
1. Which dataset to use?
2. What kind of ML problem is it?
3. How will the model be evaluated ?(e.g., accuracy, RMSE)

Example: "Use the house prices dataset to predict the selling price, evaluated using RMSE."

In [9]:
openml.tasks.list_tasks(output_format="dataframe")

From {'oml:task_id': '362155', 'oml:task_type_id': '1', 'oml:task_type': 'Supervised Classification', 'oml:did': '31', 'oml:name': 'credit-g', 'oml:status': 'active', 'oml:format': 'ARFF', 'oml:input': [{'@name': 'estimation_procedure', '#text': '0'}, {'@name': 'source_data', '#text': '31'}, {'@name': 'target_feature', '#text': 'classification problem'}], 'oml:quality': [{'@name': 'MajorityClassSize', '#text': '700.0'}, {'@name': 'MaxNominalAttDistinctValues', '#text': '10.0'}, {'@name': 'MinorityClassSize', '#text': '300.0'}, {'@name': 'NumberOfClasses', '#text': '2.0'}, {'@name': 'NumberOfFeatures', '#text': '21.0'}, {'@name': 'NumberOfInstances', '#text': '1000.0'}, {'@name': 'NumberOfInstancesWithMissingValues', '#text': '0.0'}, {'@name': 'NumberOfMissingValues', '#text': '0.0'}, {'@name': 'NumberOfNumericFeatures', '#text': '7.0'}, {'@name': 'NumberOfSymbolicFeatures', '#text': '14.0'}]}
  return __list_tasks(api_call=api_call, output_format=output_format)
From {'oml:task_id': '36

Unnamed: 0,tid,ttid,did,name,task_type,status,estimation_procedure,evaluation_measures,source_data,target_feature,...,NumberOfNumericFeatures,NumberOfSymbolicFeatures,number_samples,cost_matrix,source_data_labeled,target_feature_event,target_feature_left,target_feature_right,quality_measure,target_value
0,2,TaskType.SUPERVISED_CLASSIFICATION,2,anneal,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,2,class,...,6.0,33.0,,,,,,,,
1,3,TaskType.SUPERVISED_CLASSIFICATION,3,kr-vs-kp,Supervised Classification,active,10-fold Crossvalidation,,3,class,...,0.0,37.0,,,,,,,,
2,4,TaskType.SUPERVISED_CLASSIFICATION,4,labor,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,4,class,...,8.0,9.0,,,,,,,,
3,5,TaskType.SUPERVISED_CLASSIFICATION,5,arrhythmia,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,5,class,...,206.0,74.0,,,,,,,,
4,6,TaskType.SUPERVISED_CLASSIFICATION,6,letter,Supervised Classification,active,10-fold Crossvalidation,,6,class,...,16.0,1.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47457,362316,TaskType.SUPERVISED_CLASSIFICATION,43035,dgf_96f4164d-956d-4c1c-b161-68724eb0ccdc,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,43035,classification_diagnostic,...,12.0,45.0,,,,,,,,
47458,362317,TaskType.SUPERVISED_REGRESSION,43034,dgf_6af37c98-0933-4ae4-8380-5f63212fb52a,Supervised Regression,active,10-fold Crossvalidation,mean_absolute_error,43034,grav,...,42.0,12.0,,,,,,,,
47459,362318,TaskType.SUPERVISED_CLASSIFICATION,43044,drug-directory,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,43044,PRODUCTTYPENAME,...,3.0,1.0,,,,,,,,
47460,362319,TaskType.SUPERVISED_REGRESSION,43070,MIP-2016-regression,Supervised Regression,active,10-fold Crossvalidation,mean_absolute_error,43070,PAR10,...,145.0,1.0,,,,,,,,


#### FLOW — The Machine Learning Model & Pipeline

A flow is the algorithm or pipeline you use to train your model.

- This could be Decision Tree, Random Forest, SVM, Neural Network, etc.
- A flow also includes hyperparameters (like the learning rate or number of layers).
- Can be considered as the model blueprint.
    
Example: "Train a Random Forest with 100 trees on the house prices dataset."

In [10]:
openml.flows.list_flows(output_format="dataframe")

Unnamed: 0,id,full_name,name,version,external_version,uploader
0,1,openml.evaluation.EuclideanDistance(1.0),openml.evaluation.EuclideanDistance,1,,1
1,2,openml.evaluation.PolynomialKernel(1.0),openml.evaluation.PolynomialKernel,1,,1
2,3,openml.evaluation.RBFKernel(1.0),openml.evaluation.RBFKernel,1,,1
3,4,openml.evaluation.area_under_roc_curve(1.0),openml.evaluation.area_under_roc_curve,1,,1
4,5,openml.evaluation.average_cost(1.0),openml.evaluation.average_cost,1,,1
...,...,...,...,...,...,...
22321,25152,torch.nn.ResNet.566d5541cf9d3ff(1),torch.nn.ResNet.566d5541cf9d3ff,1,"openml==0.15.1,torch==2.2.2,torch==module.__ve...",42266
22322,25153,torch.nn.ResNet.97f79ff966027f6f(1),torch.nn.ResNet.97f79ff966027f6f,1,"openml==0.15.1,torch==2.2.2,torch==module.__ve...",42266
22323,25154,torch.nn.ResNet.802d04f19d694690(1),torch.nn.ResNet.802d04f19d694690,1,"openml==0.15.1,torch==2.2.2,torch==module.__ve...",42266
22324,25155,torch.nn.ResNet.408241f5d79970f5(1),torch.nn.ResNet.408241f5d79970f5,1,"openml==0.15.1,torch==2.4.1,torch==module.__ve...",42266


#### RUN — Actually Training & Testing the Model

A run is the actual execution of a flow on a task.

- It takes a dataset + a flow (model) + a task and trains the model.
- The results (e.g., accuracy, RMSE, confusion matrix) are stored in OpenML.
    
Example: "Train a Random Forest (100 trees) on the house prices dataset and evaluate it using RMSE = 12,000."

In [11]:
openml.runs.list_runs(size=10000, output_format="dataframe")

Unnamed: 0,run_id,task_id,setup_id,flow_id,uploader,task_type,upload_time,error_message
1,1,68,6,61,1,TaskType.LEARNING_CURVE,2014-04-06 23:30:40,
2,2,72,16,75,1,TaskType.LEARNING_CURVE,2014-04-06 23:31:13,
3,3,95,8,63,1,TaskType.LEARNING_CURVE,2014-04-06 23:32:38,
7,7,88,13,70,1,TaskType.LEARNING_CURVE,2014-04-06 23:36:01,
8,8,85,2,57,1,TaskType.LEARNING_CURVE,2014-04-06 23:38:24,
...,...,...,...,...,...,...,...,...
11879,11879,32,95,60,2,TaskType.SUPERVISED_CLASSIFICATION,2014-05-22 09:26:15,
11880,11880,58,112,76,2,TaskType.SUPERVISED_CLASSIFICATION,2014-05-22 09:26:27,
11881,11881,53,52,130,2,TaskType.SUPERVISED_CLASSIFICATION,2014-05-22 09:26:33,
11882,11882,41,148,70,2,TaskType.SUPERVISED_CLASSIFICATION,2014-05-22 09:26:57,


#### How is everything connected?

1. Dataset: Choose a dataset (e.g., house prices).
2. Task: Define the ML problem (e.g., predict house price).
3. Flow: Pick a model (e.g., Random Forest).
4. Run: Train & evaluate the model (e.g., RMSE = 12,000).
5. Results are stored for reproducibility.

#### Hugging Face: Extracting Metadata from a Model Perspective

On Hugging Face, you extract metadata from a model card or repository, focusing on:

1. Model details (architecture, task, fine-tuning source)
2. Training information (dataset used, training methodology, hyperparameters)
3. Usage guidance (intended use, risks, biases, ethical/legal aspects)
4. Performance details (evaluation results, benchmark datasets)
5. Environmental impact (CO₂ emissions)

Example: Extracting metadata from a Hugging Face Model Card

- fair4ml:trainedOn → Dataset used to train the model
- fair4ml:evaluatedOn → Dataset used for evaluation
- fair4ml:mlTask → Task (e.g., text classification, image segmentation)
- fair4ml:fineTunedFrom → Pre-trained model used for fine-tuning
- fair4ml:hasEvaluation → Performance metrics
- fair4ml:usageInstructions → How to use the model (code snippets)
- fair4ml:legal → Legal considerations

#### OpenML: Extracting Metadata from an Experiment Perspective

On OpenML, you extract metadata from experiments and datasets, not models.

OpenML focuses on tracking experiments, so metadata revolves around:

1. Datasets (source, attributes, versioning)
2. Algorithms (Flows) (which models were used, configurations)
3. Runs (which dataset was used with which model and what results were achieved)

Example: Extracting metadata from an OpenML run

- fair4ml:trainedOn → Which dataset was used in the run
- fair4ml:evaluatedOn → Dataset used for testing
- fair4ml:mlTask → The machine learning task performed
- fair4ml:hasEvaluation → Performance metrics (e.g., accuracy, F1-score)


Crosswalk for openml dataset: https://docs.google.com/spreadsheets/d/1Z1-GS_omE1mo2-WzWrtENslKd8tUkKqmqP1AsziF4uo/edit?gid=213328835#gid=213328835

In [3]:
# Note: From a machine learning model's perspective, Run on the OpenML platform is the closest to the model itself.
# Extracting the data with run perspective
import openml

run_id = 11234
run = openml.runs.get_run(run_id)

In [4]:
# Dataset, Flow and the Task associated with the Run

dataset = openml.datasets.get_dataset(run.dataset_id) 
flow = openml.flows.get_flow(run.flow_id)
task = openml.tasks.get_task(run.task_id)

In [10]:
# Note: The run does not necessarily directly link to the dataset used for evaluation.
evaluatedOn = dataset.name

# -----------------------------------------------------------------------------------------------------

hasEvaluations = run.evaluations

# -----------------------------------------------------------------------------------------------------
# Could use flow description as a flow refers to the description of a model or algorithm. 

intendedUse = description = flow.description

# -----------------------------------------------------------------------------------------------------

mlTask = task.task_type

# -----------------------------------------------------------------------------------------------------

modelCategory = run.flow_name

# -----------------------------------------------------------------------------------------------------

sharedBy = author = maintainer = {"name": run.uploader_name, "profile": "https://www.openml.org/u/"+str(run.uploader)}

# -----------------------------------------------------------------------------------------------------
# Note: In OpenML, for a single run, there is typically ONE dataset that is split into training, validating and testing portions rather than separate datasets.                           

# 1. A run is associated with a single dataset
# 2. This dataset is then split according to the task's evaluation procedure
# 3. The splits are predefined by OpenML to ensure reproducibility 

testedOn = trainedOn = validatedOn = {"dataset name" : dataset.name, "dataset page": dataset.openml_url, "estimation procedure" : task.estimation_procedure}

# -----------------------------------------------------------------------------------------------------

dateCreated = dateModified = datePublished = flow.upload_date

# -----------------------------------------------------------------------------------------------------

discussionUrl = "https://github.com/orgs/openml/discussions"

# -----------------------------------------------------------------------------------------------------

inLanguage = flow.language

# -----------------------------------------------------------------------------------------------------

keywords = run.tags # flow.tags can also be used

# -----------------------------------------------------------------------------------------------------

version = flow.version

# -----------------------------------------------------------------------------------------------------

name = "Run_"+str(run.id)

# -----------------------------------------------------------------------------------------------------

url = run.openml_url


In [11]:
metadata = {
    "fair4ml:legal": "NA",
    "fair4ml:ethicalSocial": "NA",
    "fair4ml:evaluatedOn": evaluatedOn, 
    "fair4ml:hasEvaluation": hasEvaluations,
    "fair4ml:fineTunedFrom": "NA",
    "fair4ml:hasCO2eEmissions": "NA",
    "fair4ml:intendedUse": intendedUse,
    "fair4ml:mlTask": mlTask,
    "fair4ml:modelCategory": modelCategory,
    "fair4ml:modelRisksBiasLimitations": "NA",
    "fair4ml:sharedBy": sharedBy,
    "fair4ml:testedOn": testedOn,
    "fair4ml:trainedOn": trainedOn,
    "fair4ml:usageInstructions": "NA",
    "fair4ml:codeSampleSnippet": "NA",
    "fair4ml:validatedOn": validatedOn,
    "codeRepository": "NA",
    "distribution": "NA",
    "memoryRequirements": "NA",
    "operatingSystem": "NA",
    "processorRequirements": "NA",
    "releaseNotes": "NA",
    "softwareHelp": "NA",
    "softwareRequirements": "NA",
    "storageRequirements": "NA",
    "codemeta:buildInstructions": "NA",
    "codemeta:developmentStatus": "NA",
    "codemeta:issueTracker": "NA",
    "codemeta:readme": "NA",
    "codemeta:referencePublication": "NA",
    "archivedAt": "NA",
    "author": author,
    "citation": "NA",
    "conditionsOfAccess": "NA",
    "contributor": "NA",
    "copyrightHolder": "NA",
    "dateCreated": dateCreated,
    "dateModified": dateModified,
    "datePublished": datePublished,
    "discussionUrl": discussionUrl,
    "funding": "NA",
    "inLanguage": inLanguage,
    "isAccessibleForFree": "NA",
    "keywords": keywords,
    "license": "NA",
    "maintainer": maintainer,
    "version": version,
    "description": description,
    "identifier": "NA",
    "name": name,
    "url": url,
}

print(metadata)

{'fair4ml:legal': 'NA', 'fair4ml:ethicalSocial': 'NA', 'fair4ml:evaluatedOn': 'heart-statlog', 'fair4ml:hasEvaluation': {'area_under_roc_curve': 0.809167, 'average_cost': 0.0, 'f_measure': 0.814027, 'kappa': 0.622483, 'kb_relative_information_score': 168.186634, 'mean_absolute_error': 0.185185, 'mean_prior_absolute_error': 0.493873, 'number_of_instances': 270.0, 'os_information': '[Sun Microsystems Inc., 1.6.0_20, amd64, Linux, 2.6.35.14-106.fc14.x86_64]', 'precision': 0.814698, 'predictive_accuracy': 0.814815, 'prior_entropy': 0.991207, 'recall': 0.814815, 'relative_absolute_error': 0.374966, 'root_mean_prior_squared_error': 0.496904, 'root_mean_squared_error': 0.430331, 'root_relative_squared_error': 0.866025, 'scimark_benchmark': 921.833614, 'total_cost': 0.0}, 'fair4ml:fineTunedFrom': 'NA', 'fair4ml:hasCO2eEmissions': 'NA', 'fair4ml:intendedUse': "J. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In B. Schoelkopf and C. Burges and A. Smola, e