# OpenML in Python 
OpenML is an online collaboration platform for machine learning: 

* Share/reuse machine learning datasets, algorithms, models, experiments
* Well documented/annotated datasets, uniform access
* APIs in Java, R, Python\*,... to download/upload everything
* Better reproducibility of experiments, reuse of machine learning models 
* Works well with machine learning libraries such as scikit-learn
* Large scale benchmarking, compare to state of the art

In [None]:
# Install OpenML (developer version)
pip install git+https://github.com/renatopp/liac-arff@master    
pip install git+https://github.com/openml/openml-python.git@develop

# Import and set key
import openml as oml
oml.config.apikey = 'YOURKEY'

In [3]:
# YOU CAN SKIP THIS

# General imports and settings
from preamble import *
%matplotlib inline
InteractiveShell.ast_node_interactivity = "all"
HTML('''<style>html, body{overflow: visible !important} .CodeMirror{min-width:105% !important;} .rise-enabled .CodeMirror, .rise-enabled .output_subarea{font-size:140%; line-height:1.2; overflow: visible;} .output_subarea pre{width:110%}</style>''') # For slides

## Authentication

* Create an OpenML account (free) on http://www.openml.org. 
* After logging in, open your account page (avatar on the top right)
* Open 'Account Settings', then 'API authentication' to find your API key.

There are two ways to authenticate:  

* Create a plain text file `~/.openml/config` with the line 'apikey=MYKEY', replacing MYKEY with your API key.
* Run the code below, replacing 'MYKEY' with your API key.

In [None]:
# Uncomment and run this to authenticate. Don't share your API key!
oml.config.apikey = os.environ.get('OPENMLKEY','MYKEY')

In [9]:
# Global imports and settings
from preamble import *
%matplotlib inline
InteractiveShell.ast_node_interactivity = "all"
HTML('''<style>html, body{overflow-y: visible !important} .CodeMirror{min-width:105% !important;} .rise-enabled .CodeMirror, .rise-enabled .output_subarea{font-size:150%; line-height:1.2; overflow: visible;} .output_subarea pre{width:100%}</style>''') # For slides

In [17]:
import time

start = time.time()

task_list = oml.tasks.list_tasks() # Get first 5000 tasks

end = time.time()
print(end - start)

len(task_list)


36.36972689628601


45985

In [50]:
import time

oml.config.server = 'http://localhost/OpenML/api/v1'
oml.config.apikey = '4156a6bdd8477568ec6dcd701051c91d'

start = time.time()

task_list = oml.tasks.list_tasks() # Get first 5000 tasks

end = time.time()
print(end - start)

for k, v in task_list.items():
    if v['did'] == 45:
        print(v)

len(task_list)


2.5276999473571777


4389

## List datasets

In [10]:
import openml as oml
datalist = oml.datasets.list_datasets() # Returns a dict

datalist = pd.DataFrame.from_dict(datalist, orient='index') # Create a DataFrame
print("First 10 of %s datasets..." % len(datalist))
datalist[:10][['did','name','NumberOfInstances',
               'NumberOfFeatures','NumberOfClasses']]

First 10 of 19528 datasets...


Unnamed: 0,did,name,NumberOfInstances,NumberOfFeatures,NumberOfClasses
1,1,anneal,898.0,39.0,6.0
2,2,anneal,898.0,39.0,6.0
3,3,kr-vs-kp,3196.0,37.0,2.0
4,4,labor,57.0,17.0,2.0
5,5,arrhythmia,452.0,280.0,16.0
6,6,letter,20000.0,17.0,26.0
7,7,audiology,226.0,70.0,24.0
8,8,liver-disorders,345.0,7.0,-1.0
9,9,autos,205.0,26.0,7.0
10,10,lymph,148.0,19.0,4.0


## Download datasets

In [3]:
dataset = oml.datasets.get_dataset(1471)

print("This is dataset '%s', the target feature is '%s'" % 
      (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description[:500])

This is dataset 'eeg-eye-state', the target feature is 'Class'
URL: https://www.openml.org/data/download/1587924/eeg-eye-state.ARFF
**Author**: Oliver Roesler, it12148'@'lehre.dhbw-stuttgart.de  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State), Baden-Wuerttemberg, Cooperative State University (DHBW), Stuttgart, Germany  
**Please cite**:   

All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video fr


## Train models
Train a scikit-learn model on the data manually

In [4]:
from sklearn import neighbors

dataset = oml.datasets.get_dataset(1471)
X, y = dataset.get_data(target=dataset.default_target_attribute)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

## Train models on tasks

In [5]:
task = oml.tasks.get_task(14951)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
run = oml.runs.run_task(task, clf)
run.model

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Share the run on the OpenML server

In [6]:
myrun = run.publish()
print("Uploaded to http://www.openml.org/r/" + str(myrun.run_id))

Uploaded to http://www.openml.org/r/2347815


It also works with pipelines

In [7]:
from sklearn import pipeline, ensemble, preprocessing
task = oml.tasks.get_task(59)
pipe = pipeline.Pipeline(steps=[
            ('Imputer', preprocessing.Imputer(strategy='median')),
            ('OneHotEncoder', preprocessing.OneHotEncoder(sparse=False, handle_unknown='ignore')),
            ('Classifier', ensemble.RandomForestClassifier())
           ])
run = oml.runs.run_task(task, pipe)
myrun = run.publish()
print("Uploaded to http://www.openml.org/r/" + str(myrun.run_id))

Uploaded to http://www.openml.org/r/2347816


Easy benchmarking:

In [8]:
for task_id in [14951,10103,9945]:
    task = oml.tasks.get_task(task_id)
    data = oml.datasets.get_dataset(task.dataset_id)
    clf = neighbors.KNeighborsClassifier(n_neighbors=5)
    run = oml.runs.run_task(task, clf)
    myrun = run.publish()
    print("kNN on %s: http://www.openml.org/r/%d" % (data.name, myrun.run_id))

kNN on eeg-eye-state: http://www.openml.org/r/2347817
kNN on volcanoes-a1: http://www.openml.org/r/2347818
kNN on walking-activity: http://www.openml.org/r/2347819
