<a href="https://colab.research.google.com/github/ramnathv/benchmark-suites/blob/master/OpenML%20Tutorial%20Python%20-%20RV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenML in Python 
Joaquin Vanschoren @joavanschoren

## How to...
- Download datasets and tasks in Python and Jupyter
- Use scikit-learn to build classifiers
- Upload the results to the server


# Setup

## Installation
Via source

In [0]:
#! git clone https://github.com/openml/openml-python.
#! cd openml-python
#! python setup.py install

## Authentication
* Create an OpenML account on http://www.openml.org
* Go to your OpenML account, click 'Account Settings', and then 'API authentication'. 
<center><img src="https://github.com/ramnathv/benchmark-suites/blob/master/images/openml_login.png?raw=1" width="800"></center>
* Create a little file to store the API key locally: ~/.openml/config

In [0]:
apikey=FILL_IN_API_KEY
cachedir=FILL_IN_CACHE_DIR

## Alternative authentication

In [0]:
!pip install openml -q
import openml

# assumes you have your api key in ~/.openml/config
# amueller's read/write key that he will throw away later
# openml.config.apikey='610344db6388d9ba34f6db45a3cf71de'
# we also want to use the test server so as not to polute the production system
openml.config.server = "http://www.openml.org/api/v1/xml"

# Data

## List ALL the datasets

In [7]:
from openml import datasets, tasks, runs
import numpy as np
import pandas as pd
import os

datalist = datasets.list_datasets()

data = pd.DataFrame(datalist).T
print("First 10 of %s datasets..." % len(datalist))
data[:10][['did','name','NumberOfInstances','NumberOfFeatures']]

First 10 of 3087 datasets...


Unnamed: 0,did,name,NumberOfInstances,NumberOfFeatures
2,2,anneal,898,39
3,3,kr-vs-kp,3196,37
4,4,labor,57,17
5,5,arrhythmia,452,280
6,6,letter,20000,17
7,7,audiology,226,70
8,8,liver-disorders,345,6
9,9,autos,205,26
10,10,lymph,148,19
11,11,balance-scale,625,5


Subset based on any property

In [9]:
bin_data = data.loc[data['NumberOfClasses'] == 2]
print("First 10 of %s datasets..." % len(bin_data))
bin_data[:10][['did','name', 'NumberOfInstances','NumberOfFeatures']]

First 10 of 877 datasets...


Unnamed: 0,did,name,NumberOfInstances,NumberOfFeatures
3,3,kr-vs-kp,3196,37
4,4,labor,57,17
13,13,breast-cancer,286,10
15,15,breast-w,699,10
24,24,mushroom,8124,23
25,25,colic,368,27
27,27,colic,368,23
29,29,credit-approval,690,16
31,31,credit-g,1000,21
37,37,diabetes,768,9


Subset based on any property

In [10]:
big_data = data.loc[data['NumberOfInstances'] > 60000]
big_data = big_data.sort_values(by='NumberOfInstances', ascending=True)
print("First 10 of %s datasets..." % len(bin_data))
big_data[:10][['did','name', 'NumberOfInstances']]

First 10 of 877 datasets...


Unnamed: 0,did,name,NumberOfInstances
41065,41065,mnist_rotation,62000
1588,1588,w8a,64700
41169,41169,helena,65196
41899,41899,MultilingualDS,65428
4533,4533,KEGGMetabolicReactionNetwork,65554
1591,1591,connect-4,67557
40668,40668,connect-4,67557
40996,40996,Fashion-MNIST,70000
41982,41982,Kuzushiji-MNIST,70000
554,554,mnist_784,70000


Download a specific dataset. This is done based on the dataset ID (called 'did' in the table above).

In [11]:
dataset = openml.datasets.get_dataset(23512)

print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description[:500])

This is dataset 'higgs', the target feature is called 'class'
URL: https://www.openml.org/data/v1/download/2063675/higgs.arff
**Author**: Daniel Whiteson, University of California Irvine  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/HIGGS)  
**Please cite**: Baldi, P., P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-energy Physics with Deep Learning. Nature Communications 5 (July 2, 2014).  

**Higgs Boson detection data**. The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the a


In [13]:
from IPython.display import display, Markdown, Latex
display(Markdown(dataset.description))

**Author**: Daniel Whiteson, University of California Irvine  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/HIGGS)  
**Please cite**: Baldi, P., P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-energy Physics with Deep Learning. Nature Communications 5 (July 2, 2014).  

**Higgs Boson detection data**. The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. The last 500,000 examples are used as a test set.

**Note: This is the UCI Higgs dataset, same as version 1, but it fixes the definition of the class attribute, which is categorical, not numeric.**


### Attribute Information
* The first column is the class label (1 for signal, 0 for background)
* 21 low-level features (kinematic properties): lepton pT, lepton eta, lepton phi, missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag
* 7 high-level features derived by physicists: m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb. 

For more detailed information about each feature see the original paper.

Relevant Papers:

Baldi, P., P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-energy Physics with Deep Learning. Nature Communications 5 (July 2, 2014).

In [14]:
from pprint import pprint
pprint(vars(dataset))

{'_dataset': None,
 'citation': None,
 'collection_date': None,
 'contributor': None,
 'creator': None,
 'data_file': '/root/.openml/cache/org/openml/www/datasets/23512/dataset.arff',
 'data_pickle_file': '/root/.openml/cache/org/openml/www/datasets/23512/dataset.pkl.py3',
 'dataset_id': 23512,
 'default_target_attribute': 'class',
 'description': '**Author**: Daniel Whiteson, University of California '
                'Irvine  \n'
                '**Source**: '
                '[UCI](https://archive.ics.uci.edu/ml/datasets/HIGGS)  \n'
                '**Please cite**: Baldi, P., P. Sadowski, and D. Whiteson. '
                'Searching for Exotic Particles in High-energy Physics with '
                'Deep Learning. Nature Communications 5 (July 2, 2014).  \n'
                '\n'
                '**Higgs Boson detection data**. The data has been produced '
                'using Monte Carlo simulations. The first 21 features (columns '
                '2-22) are kinematic propertie

Get the actual data

In [36]:
d = dataset.get_data(target=dataset.default_target_attribute)
d[3]

['sepallength', 'sepalwidth', 'petallength', 'petalwidth']

In [22]:
X, y = dataset.get_data(target=dataset.default_target_attribute)
iris = pd.DataFrame(X, columns=attribute_names)
iris['class'] = y
print(iris[:10])

ValueError: ignored

Train a scikit-learn model on the data

In [37]:
from sklearn import preprocessing, ensemble

dataset = datasets.get_dataset(61)
X, y, _, _ = dataset.get_data(target=dataset.default_target_attribute)
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

You can also ask which features are categorical to do your own encoding

In [44]:
X

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [45]:
X, y, _, categories = dataset.get_data(target=dataset.default_target_attribute)
enc = preprocessing.OneHotEncoder(categories=categories)
X = enc.fit_transform(X)
clf.fit(X, y)

ValueError: ignored

# Tasks

To run benchmarks consistently (also across studies and tools), OpenML offers Tasks, which include specific train-test splits and other information.

## List ALL the tasks

In [0]:
from openml import tasks
task_list = tasks.list_tasks()

In [54]:
tasks_all = pd.DataFrame(task_list).T
print("First 5 of %s tasks:" % len(tasks_all))
print(tasks_all[:5][['tid','did','name','task_type','estimation_procedure']])

First 5 of 26363 tasks:
  tid did        name                  task_type     estimation_procedure
2   2   2      anneal  Supervised Classification  10-fold Crossvalidation
3   3   3    kr-vs-kp  Supervised Classification  10-fold Crossvalidation
4   4   4       labor  Supervised Classification  10-fold Crossvalidation
5   5   5  arrhythmia  Supervised Classification  10-fold Crossvalidation
6   6   6      letter  Supervised Classification  10-fold Crossvalidation


## Download tasks

In [55]:
task = tasks.get_task(10)
print(task)

OpenML Classification Task
Task Type Description: https://www.openml.org/tt/1
Task ID..............: 10
Task URL.............: https://www.openml.org/t/10
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: class
# of Classes.........: 4
Cost Matrix..........: Available


# Runs

Run a scikit-learn classifier on the task (using the right splits)

In [0]:
import openml
openml.runs.run_model_on_task

In [58]:
from openml.runs import run_model_on_task

clf = ensemble.RandomForestClassifier()
run = run_model_on_task(task, clf)
print("RandomForest has run on the task.")

OpenMLServerException: ignored

Upload the run to the OpenML server

In [0]:
import xmltodict

return_code, response = run.publish()

if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s" % (run_id))
    print("Check it at www.openml.org/r/%s" % (run_id))

Uploaded run with id 538241
Check it at www.openml.org/r/538241


## TL;DR;
You can easily run and share scikit-learn experiments on OpenML

In [0]:
from sklearn import tree
from openml import tasks,runs
task = tasks.get_task(14951)
clf = tree.DecisionTreeClassifier()
run = runs.run_task(task, clf)
return_code, response = run.publish()

# get the run id for reference
import xmltodict
if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s. Check it at www.openml.org/r/%s" % (run_id,run_id))

4074
Uploaded run with id 595118. Check it at www.openml.org/r/595118


## Challenge: Build the 'best' model on the Higgs dataset together
* Check progress on: http://www.openml.org/t/52950

In [0]:
from openml import tasks,runs
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Imputer
task = tasks.get_task(52950)
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                                          strategy="mean",
                                          axis=0)),
                      ("knn", KNeighborsClassifier())])
run = runs.run_task(task, estimator)
return_code, response = run.publish()

# get the run id for reference
import xmltodict
if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s. Check it at www.openml.org/r/%s" % (run_id,run_id))