In [1]:
import copy

import pandas as pd

from hypex.dataset.dataset import Dataset, ExperimentData
from hypex.dataset.roles import FeatureRole, InfoRole, TargetRole

# Dataset and ExperimentData tutorial

In this tutorial, we will look on Dataset and ExperimentData classes. These are the key classes for working with data in Hypex. Their purpose to store the data, using one of the available backends (cuppently only pandas dataframe is available) and to provide the universal interface to it, in order to be able to access the data and to perform the basic operations to prepare it for the future experiments or analyses. 

# Table of contents:

<ul>
<li><a href="#create-dataset">Create Dataset</a></li>
<li><a href="#dataset-methods">Dataset Methods</a></li>
<li><a href="#create-experimentdata">Create ExperimentData</a></li>
</ul>

## Create Dataset
Initializes a new instance of the Dataset class from the data in one of the supported backends.

Args:
* __roles__: A dictionary mapping roles to their corresponding column names and types. Roles are used to mark up data by their intended purpose. There are different types of roles that have different meanings in different contexts.
* __data__: The data to be used for the dataset. Can be either a pandas DataFrame or a file path. Defaults to None.
* __backend__: The backend to be used for the dataset. Defaults to None, which is `pandas`.


In [2]:
ds = Dataset({'a': TargetRole(), 'b': TargetRole(float)})
ds

Empty DataFrame
Columns: []
Index: []

In [3]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

ds = Dataset({'a': TargetRole(), 'b': TargetRole(float)}, data=df)
ds

   a    b
0  1  4.0
1  2  5.0
2  3  6.0

In [4]:
ds.roles

{'a': Target(<class 'int'>), 'b': Target(<class 'float'>)}

#### Create empty
Create an empty Dataset with same arguments as the Dataset constructor, but without any data. Additionally you can pass index to create empty Dataset with predefined indexes and size.

In [5]:
ds_empty = Dataset.create_empty()
ds_empty

Empty DataFrame
Columns: []
Index: []

In [6]:
ds_empty = Dataset.create_empty(roles={'a': TargetRole(), 'b': TargetRole(float)}, index=range(7))
ds_empty

     a    b
0  NaN  NaN
1  NaN  NaN
2  NaN  NaN
3  NaN  NaN
4  NaN  NaN
5  NaN  NaN
6  NaN  NaN

### Backend
Backend in HypEx is the class that implements the data storage, navigation, transformation and calculation for the Dataset through the original framework. You can access it via `Dataset.backend` property. 

In [7]:
type(ds.backend)

hypex.dataset.backends.pandas_backend.PandasDataset

In [8]:
ds.backend

   a    b
0  1  4.0
1  2  5.0
2  3  6.0

For accessing the data of the backend object, you can use `Dataset.data` property.

In [9]:
type(ds.data)

pandas.core.frame.DataFrame

In [10]:
ds.data

Unnamed: 0,a,b
0,1,4.0
1,2,5.0
2,3,6.0


## Dataset Methods

In the current version of HypEx, the available functions are based on the ones commonly used in Pandas, so most of the functions work the same way. Here we will focus on those features that are significantly different from Pandas.

### From dict
This static method allows you to create a Dataset object from a dict. This method works with two types of dicts.

First way:

In [11]:
ds_from_dict = Dataset.from_dict({'a': [1, 2], 'b': [3, 4]}, {'a': TargetRole(), 'b': InfoRole()})
ds_from_dict

   a  b
0  1  3
1  2  4

Second way:

In [12]:
ds_from_dict = Dataset.from_dict([{'a': 1, 'b': 3}, {'a': 2, 'b': 4}], {'a': TargetRole(), 'b': InfoRole()})
ds_from_dict

   a  b
0  1  3
1  2  4

### Search Columns
This method allows you to search columns in a Dataset object by their roles and data types.

In [13]:
columns_found = ds.search_columns(TargetRole(), search_types=[int])
columns_found

['a']

In [14]:
ds[columns_found]

   a
0  1
1  2
2  3

### Replace roles

This method allows assign new roles to specific columns or to replace old roles with the new ones entirely for all the columns which have that replaced role.

In [15]:
ds.roles

{'a': Target(<class 'int'>), 'b': Target(<class 'float'>)}

In [16]:
ds.replace_roles({"a": FeatureRole(int), "b": InfoRole()})
ds.roles

{'a': Feature(<class 'int'>), 'b': Info(None)}

In [17]:
ds.replace_roles({FeatureRole(): TargetRole()})
ds.roles

{'a': Target(None), 'b': Info(None)}

In [18]:
ds.replace_roles({"a": TargetRole(int), "b": TargetRole(float)})
ds.roles

{'a': Target(<class 'int'>), 'b': Target(<class 'float'>)}

### Simple math methods

In [19]:
ds.mean()

        a    b
mean  2.0  5.0

In [20]:
ds.count()

       a  b
count  3  3

In [21]:
ds.log()

          a         b
0  0.000000  1.386294
1  0.693147  1.609438
2  1.098612  1.791759

In [22]:
ds.min()

     a  b
min  1  4

### Get items 
Getting items and navigating through the Dataset work in a very similar way as in Pandas. With the difference that the Dataset objects are always being returned.

In [23]:
ds[1]

     1
a  2.0
b  5.0

In [24]:
ds['a'][1]

   1
a  2

In [25]:
ds[ds[['a', 'b']] == 4]

    a    b
0 NaN  4.0
1 NaN  NaN
2 NaN  NaN

There is also a practical possibility to set data in this way, but it is limited and this is the wrong way. The main problem is that the markup of the new data is not defined, as indicated by the corresponding warning.

In [26]:
ds['c'] = [-3, -7, -9]



### Add column
The correct way to add new columns to a Dataset object is to use the add_column method.

In [27]:
ds.add_column([7, 8, 9], {'c': TargetRole(int)})
ds

   a    b  c
0  1  4.0  7
1  2  5.0  8
2  3  6.0  9

### Apply

The Dataset apply function works similarly to the apply function in pandas library, but it requires additional information about the roles in the Dataset.

In [28]:
ds.apply(lambda x: x ** 2 , role={'a': TargetRole(int), 'b': TargetRole(float), 'c': TargetRole(float)})

   a     b     c
0  1  16.0  49.0
1  4  25.0  64.0
2  9  36.0  81.0

### Group by

Groupby method operates in 2 modes:

- The first mode groups by fields and gets the agg function of the inner Dataset.
- The second mode groups by fields and returns `Tuple[group_key, sub_dataset]`

In [29]:
groups_func = ds.groupby('a', func='mean')
groups_func

[(1,
          a    b    c
  mean  1.0  4.0  7.0),
 (2,
          a    b    c
  mean  2.0  5.0  8.0),
 (3,
          a    b    c
  mean  3.0  6.0  9.0)]

In [30]:
groups = ds.groupby('a')
groups

[(1,
     a    b  c
  0  1  4.0  7),
 (2,
     a    b  c
  1  2  5.0  8),
 (3,
     a    b  c
  2  3  6.0  9)]

In [31]:
groups_func_fields = ds.groupby('a', func=['mean', 'var'], fields_list='c')
groups_func_fields

[(1,
          c
  mean  7.0
  var   NaN),
 (2,
          c
  mean  8.0
  var   NaN),
 (3,
          c
  mean  9.0
  var   NaN)]

### Transpose
Specifics of the transpose function is that it resets the roles in the new Dataset, so the function has the argument `roles` to allow to set the new roles. Default is `roles=None`. In this case, all roles will be set to FeatureRole with the automatically identified data types.

In [32]:
ds.transpose({'one': FeatureRole(), '2': InfoRole(), 'III': InfoRole()})

   one    2  III
a  1.0  2.0  3.0
b  4.0  5.0  6.0
c  7.0  8.0  9.0

In [33]:
ds.transpose()

     0    1    2
a  1.0  2.0  3.0
b  4.0  5.0  6.0
c  7.0  8.0  9.0

In [34]:
ds.transpose().roles

{0: Default(<class 'float'>),
 1: Default(<class 'float'>),
 2: Default(<class 'float'>)}

In [35]:
ds.transpose(['one', '2', 'III'])

   one    2  III
a  1.0  2.0  3.0
b  4.0  5.0  6.0
c  7.0  8.0  9.0

### Shuffle
Shuffles the rows of the dataset.

In [47]:
ds.sample(frac=1.0)

   a    b  c
1  2  5.0  8
0  1  4.0  7
2  3  6.0  9

### Replace
The behaviour is similar to the one in Pandas, but the type requires to be set if changed.

In [48]:
dsr = copy.deepcopy(ds)
dsr.replace(2, 15)

    a    b  c
0   1  4.0  7
1  15  5.0  8
2   3  6.0  9

In [49]:
dsr.roles['a'] = TargetRole(str)
dsr.replace(1, "a")

   a    b  c
0  a  4.0  7
1  2  5.0  8
2  3  6.0  9

### Append
Append method adds a new row to the end of the dataset.

In [52]:
ds.append(ds)

   a    b  c
0  1  4.0  7
1  2  5.0  8
2  3  6.0  9
0  1  4.0  7
1  2  5.0  8
2  3  6.0  9

In [53]:
ds

   a    b  c
0  1  4.0  7
1  2  5.0  8
2  3  6.0  9

# Eperiment Data

ExperimentData is the structure that contains several datasets, which form the data for the experiment. It contains: 
* `ds` - researched dataset
* `additional_fields` - additional fields that may be added to the dataset by merge on index: column - is state id of executor
* `variables` - the results of the executors that will be returned by once value: key - is state id of executor
* `analysis_tables` - dictionary of tables from executors: key - is state id of executor, value - is table from executor
* `groups` - cache of the data split for the optimisation of calculation

In [54]:
ed = ExperimentData(ds)

In [55]:
ed.ds

   a    b  c
0  1  4.0  7
1  2  5.0  8
2  3  6.0  9

In [56]:
ed.additional_fields

Empty DataFrame
Columns: []
Index: [0, 1, 2]

In [57]:
ed.variables

{}

In [58]:
ed.analysis_tables

{}

In [59]:
ed.groups

{}