# Data preprocessing

This tutorial will cover how to extract certain IDyOM outputs (for analysis in python) and export them in different
formats using py2lispIDyOM, given that you already have the `.dat` file output. For an overview of the py2lispIDyOM functionality, see the [README](../README.md).

---
### Content

[1. Extracting IDyOM output data](#1-extracting-idyom-output-data)

## 1. Extracting IDyOM output data


Given that you already have the `.dat` file output, you can extract certain properties of certain melodies from that file. 

We will continue the sample example as in the running IDyOM tutorial, and extract some IDyOM outputs from that experiment, where the log folder is `experiment_history/21-05-22_17.05.05/`


In [3]:
# import ExperimentInfo
from py2lispIDyOM.extract import ExperimentInfo

### Indicate the experiment log folder that you want to work with:

To start, users need to indicate the experiment log folder that you want to work with by providing the log
path `experiment_folder_path`. 

In [4]:
# Set experiment_folder_path:
my_experiment = ExperimentInfo(experiment_folder_path='experiment_history/21-05-22_17.05.05/')


### Experiment Info: to access some melodies in that experiment:

There are two ways to access melodies in the experiment:
1. Access a specific melody using `melodies_dict` by passing the melody name.
  - This returns a DataFrame of all IDyOM outputs for this melody.
   
   
2. Access specific melodies using the method `access_melodies(starting_index=None, ending_index=None, melody_names=None)` 
  - This returns a list of DataFrame of all IDyOM outputs for selected melodies.
    

In [13]:
# Access the melody named '"chor-012"' using `melodies_dict`:

selected_melody = my_experiment.melodies_dict['"chor-012"']

print(selected_melody)

      dataset.id  melody.id  note.id melody.name vertint12  articulation  \
0   6.605212e+13       12.0      1.0  "chor-012"        NA           0.0   
1   6.605212e+13       12.0      2.0  "chor-012"        NA           0.0   
2   6.605212e+13       12.0      3.0  "chor-012"        NA           0.0   
3   6.605212e+13       12.0      4.0  "chor-012"        NA           0.0   
4   6.605212e+13       12.0      5.0  "chor-012"        NA           0.0   
5   6.605212e+13       12.0      6.0  "chor-012"        NA           0.0   
6   6.605212e+13       12.0      7.0  "chor-012"        NA           0.0   
7   6.605212e+13       12.0      8.0  "chor-012"        NA           0.0   
8   6.605212e+13       12.0      9.0  "chor-012"        NA           0.0   
9   6.605212e+13       12.0     10.0  "chor-012"        NA           0.0   
10  6.605212e+13       12.0     11.0  "chor-012"        NA           0.0   
11  6.605212e+13       12.0     12.0  "chor-012"        NA           0.0   
12  6.605212

In [20]:
# Access the melody named '"chor-010"',"chor-011"' using the `access_melodies` method.

selected_melodies = my_experiment.access_melodies(melody_names = ['"chor-010"','"chor-011"'])

# print(selected_melodies)


In [21]:
# Access the first 3 melodies using the `access_melodies` method.

selected_melodies = my_experiment.access_melodies(ending_index=3)

# print(selected_melodies)


### Melody Info: to further access melody-specific information

For each melody in the experiment, all data are stored in the `MelodyInfo` class which is essentially a panda.DataFrame. To create an instance of `MelodyInfo`, we can use `ExperimentInfo.melodies_dict`, or `ExperimentInfo.access_melodies` as showed above.

In [24]:
# first, access the melody:
selected_melody = my_experiment.melodies_dict['"chor-002"']


Note that here, `selected_melody` is an instance of `MelodyInfo`.

#### Check the valid IDyOM output keywords for selected melody

In [25]:
# check the valid IDyOM output keywords for this selected_melody:

valid_idyom_output_key_list = selected_melody.get_idyom_output_keyword_list()

print(valid_idyom_output_key_list)

['dataset.id', 'melody.id', 'note.id', 'melody.name', 'vertint12', 'articulation', 'comma', 'voice', 'ornament', 'dyn', 'phrase', 'bioi', 'deltast', 'accidental', 'mpitch', 'cpitch', 'barlength', 'pulses', 'tempo', 'mode', 'keysig', 'dur', 'onset', 'cpitch.order.ltm.cpitch', 'cpitch.order.stm.cpitch', 'cpitch.weight.ltm', 'cpitch.weight.stm', 'cpitch.weight.ltm.cpitch', 'cpitch.weight.stm.cpitch', 'cpitch.probability', 'cpitch.information.content', 'cpitch.entropy', 'cpitch.55', 'cpitch.57', 'cpitch.58', 'cpitch.59', 'cpitch.60', 'cpitch.62', 'cpitch.63', 'cpitch.64', 'cpitch.65', 'cpitch.66', 'cpitch.67', 'cpitch.68', 'cpitch.69', 'cpitch.70', 'cpitch.71', 'cpitch.72', 'cpitch.73', 'cpitch.74', 'cpitch.75', 'cpitch.76', 'cpitch.77', 'cpitch.78', 'cpitch.79', 'cpitch.81', 'cpitch.82', 'cpitch.83', 'cpitch.84', 'cpitch.85', 'cpitch.86', 'cpitch.88', 'onset.order.ltm.onset', 'onset.order.stm.onset', 'onset.weight.ltm', 'onset.weight.stm', 'onset.weight.ltm.onset', 'onset.weight.stm.onset

The list above shows all the valid IDyOM output keys available for the melody '"chor-002"'.
Now, we want to access the following data: `cpitch.information.content`, `onset.information.content`, `entropy`

#### Access IDyOM output data via keywords

We will use the `MelodyInfo` method called `access_idyom_output_keywords`.

You need to pass a list of keywords to the method, and it will return a dataframe.

In [26]:
# Accessing `cpitch.information.content`, `onset.information.content`, `entropy` for '"chor-002"'

selected_idyom_outputs = selected_melody.access_idyom_output_keywords(['cpitch.information.content',
                                                                      'onset.information.content',
                                                                      'entropy'])

print(selected_idyom_outputs)

    cpitch.information.content  onset.information.content   entropy
0                     5.065964                   2.880014  8.166577
1                     5.472063                   3.583843  7.022451
2                     4.075408                   0.902314  6.759365
3                     6.103196                   1.487593  5.508813
4                     2.547856                   2.377832  5.474818
5                     5.943005                   3.686743  6.267564
6                     3.631144                   3.251021  5.693554
7                     7.146198                   1.028696  4.829847
8                     3.317424                   0.899349  6.363213
9                     2.924594                   1.044276  5.423235
10                    3.381295                   5.122056  5.354833
11                    6.861183                   2.590604  5.034352
12                    5.999832                   3.083826  5.695037
13                    2.485237                  