# Dataset and ExperimentData tutorial

In this tutorial, we will look on Dataset and ExperimentData classes. This is key classes for working with data in Hypex.

In [1]:
import copy
import pandas as pd

from hypex.dataset.dataset import Dataset, ExperimentData
from hypex.dataset.roles import TargetRole, InfoRole, FeatureRole

## Create Dataset
Initializes a new instance of the Dataset class.

Args:
* __roles__: A dictionary mapping roles to their corresponding column names and types. Roles are used to mark up data by their intended purpose. There are different types of roles that have different meanings in different contexts.
* __data__: The data to be used for the dataset. Can be either a pandas DataFrame or a file path. Defaults to None.
* __backend__: The backend to be used for the dataset. Defaults to None, but None is a `pandas`.


In [2]:
ds = Dataset({'a': TargetRole(), 'b': TargetRole(float)})
ds

Empty DataFrame
Columns: []
Index: []

In [3]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

ds = Dataset({'a': TargetRole(), 'b': TargetRole(float)}, data=df)
ds

   a    b
0  1  4.0
1  2  5.0
2  3  6.0

In [4]:
ds.roles

{'a': Target(<class 'int'>), 'b': Target(<class 'float'>)}

## Create empty
Create an empty Dataset with same arguments as the Dataset constructor, but without any data. Additional you can pass index for creating empty Dataset with defined index and size.

In [5]:
ds_empty = Dataset.create_empty()
ds_empty

Empty DataFrame
Columns: []
Index: []

In [6]:
ds_empty = Dataset.create_empty(roles={'a': TargetRole(), 'b': TargetRole(float)}, index=range(7))
ds_empty

     a    b
0  NaN  NaN
1  NaN  NaN
2  NaN  NaN
3  NaN  NaN
4  NaN  NaN
5  NaN  NaN
6  NaN  NaN

## Backend
Backend in HypEx is a class that adapts the data storage, navigation, transformation and calculation for the Dataset from original framework. You can access it via `Dataset.backend` property.

In [10]:
type(ds.backend)

hypex.dataset.backends.pandas_backend.PandasDataset

In [12]:
ds.backend

   a    b
0  1  4.0
1  2  5.0
2  3  6.0

For accessing the data of the backend object, you can use `Dataset.data` property.

In [11]:
type(ds.data)

pandas.core.frame.DataFrame

In [13]:
ds.data

Unnamed: 0,a,b
0,1,4.0
1,2,5.0
2,3,6.0


## Dataset Methods

In the current version of HypEx, when implementing functions, we orienton Pandas, so most functions work the same way. Here we will focus on those features that are significantly different from Pandas.

### From dict
It staticmethod allow you can use to create a Dataset object from a dict. This method work with two types of dicts.


**First way**

In [16]:
ds_from_dict = Dataset.from_dict({'a': [1, 2], 'b': [3, 4]}, {'a': TargetRole(), 'b': InfoRole()})
ds_from_dict

   a  b
0  1  3
1  2  4

**Second way**

In [17]:
ds_from_dict = Dataset.from_dict([{'a': 1, 'b': 3}, {'a': 2, 'b': 4}], {'a': TargetRole(), 'b': InfoRole()})
ds_from_dict

   a  b
0  1  3
1  2  4

### Search Columns
This method allow you can use to search columns in a Dataset object by role and types.

In [19]:
columns_found = ds.search_columns(TargetRole(), search_types=[int])
columns_found

['a']

In [20]:
ds[columns_found]

   a
0  1
1  2
2  3

### Simple math methods

In [21]:
ds.mean()

        a    b
mean  2.0  5.0

In [22]:
ds.count()

       a  b
count  3  3

In [23]:
ds.log()

          a         b
0  0.000000  1.386294
1  0.693147  1.609438
2  1.098612  1.791759

In [24]:
ds.min()

     a    b
min  1  4.0

### Get items 
Getting items and navigation work in a very similar way to Pandas. With the difference that the Dataset objects are always returned.

In [25]:
ds[1]

     1
a  2.0
b  5.0

In [26]:
ds['a'][1]

   1
a  2

In [27]:
ds[ds[['a', 'b']] == 4]

    a    b
0 NaN  4.0
1 NaN  NaN
2 NaN  NaN

There is also a practical possibility to set data in this way, but it is limited and this is the wrong way. The main problem is that the markup of the new data is not defined, as indicated by the corresponding warning.

In [35]:
ds['c'] = [-3, -7, -9]



### Add column
It is right way to add a column to the dataset.

In [36]:
ds.add_column([7, 8, 9], {'c': TargetRole(int)})
ds

   a    b  c
0  3  4.0  7
1  7  5.0  8
2  9  6.0  9

### Apply

The Dataset apply function works similarly to the apply function in the pandas library, but it requires additional information about the roles in the Dataset created in this way.

In [46]:
ds.apply(lambda x: x ** 2 , role={'a': TargetRole(int), 'b': TargetRole(float), 'c': TargetRole(float)})

    a     b     c
0   9  16.0  49.0
1  49  25.0  64.0
2  81  36.0  81.0

### Group by

Groupby method works in 2 modes:

- The first mode is to group by a fields and to get the agg function of the inner Dataset.
- The second mode is to group by a fields and returns `Tuple[group_key, sub_dataset]`

In [49]:
groups_func = ds.groupby('a', func='mean')
groups_func

[(3,
          a    b    c    e
  mean  3.0  4.0  7.0  1.0),
 (7,
          a    b    c    e
  mean  7.0  5.0  8.0  2.0),
 (9,
          a    b    c    e
  mean  9.0  6.0  9.0  3.0)]

In [50]:
groups = ds.groupby('a')
groups

[(3,
     a    b  c  e
  0  3  4.0  7  1),
 (7,
     a    b  c  e
  1  7  5.0  8  2),
 (9,
     a    b  c  e
  2  9  6.0  9  3)]

In [51]:
groups_func_fields = ds.groupby('a', func=['mean', 'var'], fields_list='e')
groups_func_fields

[(3,
          e
  mean  1.0
  var   NaN),
 (7,
          e
  mean  2.0
  var   NaN),
 (9,
          e
  mean  3.0
  var   NaN)]

### Transpose
Specificity of the transpose function is rteset roles in new Dataset, so function have argument `roles`. Default is `roles=None`. In this case, the all roles are FeatureRole.

In [52]:
ds.transpose({'one': FeatureRole(), '2': InfoRole(), 'III': InfoRole()})

   one    2  III
a  3.0  7.0  9.0
b  4.0  5.0  6.0
c  7.0  8.0  9.0
e  1.0  2.0  3.0

In [53]:
ds.transpose()

     0    1    2
a  3.0  7.0  9.0
b  4.0  5.0  6.0
c  7.0  8.0  9.0
e  1.0  2.0  3.0

In [54]:
ds.transpose().roles

{0: Feature(<class 'float'>),
 1: Feature(<class 'float'>),
 2: Feature(<class 'float'>)}

In [55]:
ds.transpose(['one', '2', 'III'])

   one    2  III
a  3.0  7.0  9.0
b  4.0  5.0  6.0
c  7.0  8.0  9.0
e  1.0  2.0  3.0

### Shuffle
Shuffles the dataset rows of the dataset.

In [56]:
ds.shuffle()

   a    b  c  e
1  7  5.0  8  2
0  3  4.0  7  1
2  9  6.0  9  3

In [57]:
ds.shuffle()

   a    b  c  e
0  3  4.0  7  1
2  9  6.0  9  3
1  7  5.0  8  2

In [58]:
ds.shuffle(random_state=42)

   a    b  c  e
0  3  4.0  7  1
1  7  5.0  8  2
2  9  6.0  9  3

### Replace
As Pandas, but types are required atention.

In [60]:
dsr = copy.deepcopy(ds)
dsr.replace(2, 15)

   a    b  c   e
0  3  4.0  7   1
1  7  5.0  8  15
2  9  6.0  9   3

In [64]:
# dsr.replace(1, "a") raise ValueError
dsr.roles['e'] = TargetRole(str)
dsr.replace(1, "a")

   a    b  c  e
0  3  4.0  7  a
1  7  5.0  8  2
2  9  6.0  9  3

### Append
Append method adds a new row to the end of the dataset.

In [65]:
ds.append(ds)

   a    b  c  e
0  3  4.0  7  1
1  7  5.0  8  2
2  9  6.0  9  3
0  3  4.0  7  1
1  7  5.0  8  2
2  9  6.0  9  3

In [66]:
ds.shuffle().append(other=ds, index=True)

   a    b  c  e
0  7  5.0  8  2
1  9  6.0  9  3
2  3  4.0  7  1
3  3  4.0  7  1
4  7  5.0  8  2
5  9  6.0  9  3

In [67]:
ds.shuffle().append(other=[ds]*2, index=True)

   a    b  c  e
0  9  6.0  9  3
1  3  4.0  7  1
2  7  5.0  8  2
3  3  4.0  7  1
4  7  5.0  8  2
5  9  6.0  9  3
6  3  4.0  7  1
7  7  5.0  8  2
8  9  6.0  9  3

# Eperiment Data

ExperimentData is structure that contains: 
* `ds` - researched dataset
* `additional_fields` - additional fields that may be added to the dataset by merge on index: column - is state id of executor
* `variables` - it is results of executors that will be returned by once value: key - is state id of executor
* `analysis_tables` - dictionary of tables from executors: key - is state id of executor, value - is table from executor
* `groups` - cache of splitted data for optimisation of calculation

In [71]:
ed = ExperimentData(ds)

In [72]:
ed.ds

   a    b  c  e
0  3  4.0  7  1
1  7  5.0  8  2
2  9  6.0  9  3

In [73]:
ed.additional_fields

Empty DataFrame
Columns: []
Index: [0, 1, 2]

In [74]:
ed.variables

{}

In [75]:
ed.analysis_tables

{}

In [76]:
ed.groups

{}