# OpenML Dataset Creation and Properties

In this notebook, we will look at how OpenML's dataset is created and make observations about its potential 
as sktime's benchmarking framework.

TLDR:
- In an OpenML dataset, each variable is in a column, each time point is a row, just like in a DataFrame.
- Not required to have `target` column but since OpenML doesnnot support Pd.Series inside DataFrame => Need to turn time series 
into rolling-window format.
- Because dataset has to be rolling window, adapters need to be made for .ts format, estimator, CV splitter and `metrics`
- Still not sure how to make a 1:1 conversion of MultiIndex DataFrame to OpenML Dataset 
___

## 1. Create a simple dataset

In [45]:
import numpy as np
import openml
import pandas as pd


openml.config.apikey = "86650cb6698383104877f9efba47ce77"

The dataset of OpenML has the following template. Values are for example.

In [60]:
# Raw data input as list
data = [
    ["sunny", 85, 85, "FALSE", "no"],
    ["sunny", 80, 90, "TRUE", "no"],
    ["overcast", 83, 86, "FALSE", "yes"],
    ["rainy", 70, 96, "FALSE", "yes"],
    ["rainy", 68, 80, "FALSE", "yes"],
    ["rainy", 65, 70, "TRUE", "no"],
    ["overcast", 64, 65, "TRUE", "yes"],
    ["sunny", 72, 95, "FALSE", "no"],
    ["sunny", 69, 70, "FALSE", "yes"],
    ["rainy", 75, 80, "FALSE", "yes"],
    ["sunny", 75, 70, "TRUE", "yes"],
    ["overcast", 72, 90, "TRUE", "yes"],
    ["overcast", 81, 75, "FALSE", "yes"],
    ["rainy", 71, 91, "TRUE", "no"],
]

# Each row is a variable, for categorical, there is a list that contains all
# category expected
attribute_names = [
    ("outlook", ["sunny", "overcast", "rainy"]),
    ("temperature", "REAL"),
    ("humidity", "REAL"),
    ("windy", ["TRUE", "FALSE"]),
    ("play", ["yes", "no"]),
]

description = (
    "The weather problem is a tiny dataset that we will use repeatedly"
    " to illustrate machine learning methods. Entirely fictitious, it "
    "supposedly concerns the conditions that are suitable for playing "
    "some unspecified game. In general, instances in a dataset are "
    "characterized by the values of features, or attributes, that measure "
    "different aspects of the instance. In this case there are four "
    "attributes: outlook, temperature, humidity, and windy. "
    "The outcome is whether to play or not."
)

citation = (
    "I. H. Witten, E. Frank, M. A. Hall, and ITPro,"
    "Data mining practical machine learning tools and techniques, "
    "third edition. Burlington, Mass.: Morgan Kaufmann Publishers, 2011"
)

Change data format to DataFrame for better visualization

In [61]:
import pandas as pd

df = pd.DataFrame(data, columns=[col_name for col_name, _ in attribute_names])
# enforce the categorical column to have a categorical dtype
df["outlook"] = df["outlook"].astype("category")
df["windy"] = df["windy"].astype("bool")
df["play"] = df["play"].astype("category")
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   outlook      14 non-null     category
 1   temperature  14 non-null     int64   
 2   humidity     14 non-null     int64   
 3   windy        14 non-null     bool    
 4   play         14 non-null     category
dtypes: bool(1), category(2), int64(2)
memory usage: 650.0 bytes
None


In [62]:
df

Unnamed: 0,outlook,temperature,humidity,windy,play
0,sunny,85,85,True,no
1,sunny,80,90,True,no
2,overcast,83,86,True,yes
3,rainy,70,96,True,yes
4,rainy,68,80,True,yes
5,rainy,65,70,True,no
6,overcast,64,65,True,yes
7,sunny,72,95,True,no
8,sunny,69,70,True,yes
9,rainy,75,80,True,yes


Finally, create and upload the dataset

In [63]:
weather_dataset = create_dataset(
    name="Weather",
    description=description,
    creator="I. H. Witten, E. Frank, M. A. Hall, and ITPro",
    contributor=None,
    collection_date="01-01-2011",
    language="English",
    licence=None,
    default_target_attribute="play",
    row_id_attribute=None,
    ignore_attribute=None,
    citation=citation,
    attributes="auto",
    data=df,
    version_label="example",
)

weather_dataset.publish()

OpenML Dataset
Name.........: Weather
Version......: None
Format.......: arff
Licence......: None
Download URL.: None
OpenML URL...: https://www.openml.org/d/43961
# of features: None

The URL above can be shared and downloaed by other people

Observations:
- Data needs to be in either array or DataFrame format. Pd.Series inside DataFrame and MultiIndex DataFrame is not supported.
___

## 2. Investigate a Time Series Dataset

As an example, we will download the Reliance-Industries-(RIL)-Share-Price-(1996-2020)

In [74]:
dataset = openml.datasets.get_dataset(43816)
X = dataset.get_data(dataset_format="dataframe")



Check the properties of this dataset

In [75]:
# Print a summary
print(
    f"This is dataset '{dataset.name}', the target feature is "
    f"'{dataset.default_target_attribute}'"
)
print(f"URL: {dataset.url}")
print(dataset.description[:500])

This is dataset 'Reliance-Industries-(RIL)-Share-Price-(1996-2020)', the target feature is 'None'
URL: https://old.openml.org/data/v1/download/22102641/Reliance-Industries-(RIL)-Share-Price-(1996-2020).arff
Content
We have daily stock prices of Reliance Industries (RIL) the parent of telecom company Jio Platform for which we have multiple investments in India including from Facebook and Intel
This is owned by Mukesh Ambani, the richest person in India. Also, among the top 10 richest person in the world! 
Data Dictionary
Date    ==== Date of information
Symbol    ==== Name of share. Reliance Industries in this case
Series    ==== Equities i.e., stock price
Prev Close ==== Price stock closed in the l


In [76]:
X

(            Date    Symbol Series  Prev_Close     Open     High      Low  \
 0     01-01-1996  RELIANCE     EQ      204.65   205.00   206.10   203.65   
 1     02-01-1996  RELIANCE     EQ      205.75   205.25   206.25   202.65   
 2     03-01-1996  RELIANCE     EQ      204.15   207.50   216.95   205.25   
 3     04-01-1996  RELIANCE     EQ      205.70   203.75   204.40   201.05   
 4     05-01-1996  RELIANCE     EQ      203.80   203.00   203.00   200.65   
 ...          ...       ...    ...         ...      ...      ...      ...   
 6200  23-11-2020  RELIANCE     EQ     1899.50  1951.00  1970.00  1926.25   
 6201  24-11-2020  RELIANCE     EQ     1950.70  1964.00  1974.00  1932.00   
 6202  25-11-2020  RELIANCE     EQ     1964.05  1980.00  1992.95  1942.20   
 6203  26-11-2020  RELIANCE     EQ     1947.80  1953.05  1965.00  1930.05   
 6204  27-11-2020  RELIANCE     EQ     1952.60  1940.50  1956.10  1921.40   
 
          Last    Close     VWAP    Volume      Turnover    Trades  \
 0  

Observations:
1. DataFrame MultiIndex is not supported. Will be a problem for Hierachichal data type.
2. Because pd.Series inside DataFrame is not supported, this will create a problem for sktime's estimators, metrics and `evaluate`
3. Current workaround is to turn sktime's dataset into supervised, rolling-window format - will potentially work with adapters for estimators, metrics
4. Not sure how to implement CV split from `evaluate` for this adapter 
5. Assuming 3. and 4. are resolved, we still have 1....