# The Run_SE Module

The `run_SE` module of the `repytah-se` constructs and analyzes Start End (SE) diagrams. The SE diagrams are made from Aligned Hierarchies (AH) and based on Topological Data Analysis. The module allows the user to process directories that hold multiple files and process single files. 

SE Diagrams are better at comparing songs for the cover song detection task compared to Aligned Hierarchies, whose strength lies in visualization, however the two songs must be the exact same length. Because of this, it is recommended to use SNL Diagrams for this task.  



## Pipeline

The overall pipeline for `repytah-se` is shown below.

<img src="pictures/SNLflow.jpg" width="380">



To create a distance matrix from SE/SNL diagrams , it is recommended to provide multiple files.

You can choose to start from chroma vector files, however you can also start from aligned hierarchies if you have already processed them. There is also the option to process a single file to SE/SNL diagrams or process a directory of multiple files to SE/SNL diagrams and distance matrices. There will be examples provided to help you with the possible options. 

## Import Modules


In [1]:
import numpy as np
import pandas as pd
import persim
import pkg_resources

from run_SE import *


## Processing a Single File

This phase is the integration between `repytah` and `repytah-se`. We create aligned hierarchies from the chroma vectors of a music-based data stream (eg. a song) in `.csv` files. In this step, we choose a threshold value to filter out noise and a number of feature vectors per shingle for contextualization. If running multiple tests with the same threshold and number of feature vectors per shingle values, it is recommended to save the intermediate aligned hierarchies. Please note that while `repytah` contains methods to save aligned hierarchies as `.csv` or `.mat` files, the intermediate aligned hierarchies in this module will be saved in `.mat` format. 

For more on how aligned hierarchies are created, see `example_vignette.ipynb`.

We are using Chopin's Mazurka Op.6, No.1 as input for this demonstration.

In [2]:
# starting from chroma vectors

filepath = 'data/input.csv'
num_fv_per_shingle = 12
thresh = 0.02

# default for isChroma is True
SE_diagram = get_SEs(filepath, num_fv_per_shingle, thresh, isChroma=True, save=False)

print(SE_diagram)


[[  1   2]
 [ 49  50]
 [  2   3]
 [ 26  27]
 [ 50  51]
 [ 74  75]
 [146 147]
 [218 219]
 [290 291]
 [314 315]
 [  3   4]
 [ 51  52]
 [123 124]
 [195 196]
 [291 292]
 [ 39  40]
 [231 232]
 [  1   3]
 [ 49  51]
 [  2   4]
 [ 50  52]
 [290 292]
 [ 37  39]
 [ 85  87]
 [157 159]
 [229 231]
 [ 37  40]
 [229 232]
 [  1   4]
 [ 49  52]
 [247 251]
 [271 275]
 [ 27  37]
 [ 75  85]
 [147 157]
 [219 229]
 [315 325]
 [ 26  37]
 [ 74  85]
 [146 157]
 [218 229]
 [314 325]
 [ 27  39]
 [ 75  87]
 [147 159]
 [219 231]
 [ 27  40]
 [219 232]
 [ 26  39]
 [ 74  87]
 [146 159]
 [218 231]
 [ 26  40]
 [218 232]
 [  4  26]
 [ 52  74]
 [124 146]
 [196 218]
 [292 314]
 [  3  26]
 [ 51  74]
 [123 146]
 [195 218]
 [291 314]
 [  4  27]
 [ 52  75]
 [124 147]
 [196 219]
 [292 315]
 [  3  27]
 [ 51  75]
 [123 147]
 [195 219]
 [291 315]
 [  2  26]
 [ 50  74]
 [290 314]
 [  2  27]
 [ 50  75]
 [290 315]
 [  1  26]
 [ 49  74]
 [  1  27]
 [ 49  75]
 [  4  37]
 [ 52  85]
 [124 157]
 [196 229]
 [292 325]
 [  3  37]
 [ 51  85]

In [3]:
# starting from AH

filepath ='data/input.mat'

# these values don't matter since starting from AHs
num_fv_per_shingle = 12
thresh = 0.02

SE_diagram = get_SEs(filepath, num_fv_per_shingle, thresh, isChroma=False, save=False)
print(SE_diagram)




[[  1   2]
 [ 49  50]
 [  2   3]
 [ 26  27]
 [ 50  51]
 [ 74  75]
 [146 147]
 [218 219]
 [290 291]
 [314 315]
 [  3   4]
 [ 51  52]
 [123 124]
 [195 196]
 [291 292]
 [ 39  40]
 [231 232]
 [  1   3]
 [ 49  51]
 [  2   4]
 [ 50  52]
 [290 292]
 [ 37  39]
 [ 85  87]
 [157 159]
 [229 231]
 [ 37  40]
 [229 232]
 [  1   4]
 [ 49  52]
 [247 251]
 [271 275]
 [ 27  37]
 [ 75  85]
 [147 157]
 [219 229]
 [315 325]
 [ 26  37]
 [ 74  85]
 [146 157]
 [218 229]
 [314 325]
 [ 27  39]
 [ 75  87]
 [147 159]
 [219 231]
 [ 27  40]
 [219 232]
 [ 26  39]
 [ 74  87]
 [146 159]
 [218 231]
 [ 26  40]
 [218 232]
 [  4  26]
 [ 52  74]
 [124 146]
 [196 218]
 [292 314]
 [  3  26]
 [ 51  74]
 [123 146]
 [195 218]
 [291 314]
 [  4  27]
 [ 52  75]
 [124 147]
 [196 219]
 [292 315]
 [  3  27]
 [ 51  75]
 [123 147]
 [195 219]
 [291 315]
 [  2  26]
 [ 50  74]
 [290 314]
 [  2  27]
 [ 50  75]
 [290 315]
 [  1  26]
 [ 49  74]
 [  1  27]
 [ 49  75]
 [  4  37]
 [ 52  85]
 [124 157]
 [196 229]
 [292 325]
 [  3  37]
 [ 51  85]

## Processing Multiple Files

This module is based on Chopin's Mazurka Dataset, which has expanded and nonexpanded forms for each piece. Thus, the data directories are set up as below. For the smoothest running out of the box, it is recommended that you also organize your files in this fashion, but modified for your purposes.

Before running, the directory should look like this:
<img src="pictures/begin_directoryascii.png" width="300">

If saving intermediate Aligned Hierarchies, the directory will look like this:
<img src="pictures/after_directoryascii.png" width="380">

In [4]:
dir_path ='data/chroma_vectors'

num_fv_per_shingle = 12
thresh = 0.02

# print(dirs)
SE_all = get_SE_directory(dir_path, num_fv_per_shingle, thresh, isChroma=True, save=False)
print("There are {} total SE diagrams in {}".format((len(SE_all.SEs) * len(SE_all.SEs[0])), dir_path))
print("The classes are:", SE_all.className)
print("The labels for each SE diagram are:", SE_all.labels)


There are 4 total SE diagrams in data/chroma_vectors
The classes are: ['expanded', 'nonexpanded']
The labels for each SE diagram are: [['mazurka06-1', 'mazurka06-2'], ['mazurka06-1', 'mazurka06-2']]


In [5]:
print(len(SE_all.SEs))
print(len(SE_all.SEs[0]))
print(len(SE_all.SEs[0][0]))


2
2
157


## get_dist_mat

This function is used to pairwise compare songs. However, it is recommended to do this with SNL Diagrams instead of SE diagrams because in SNL Diagrams, you can compare two songs of different length. 

There are two distance metrics to choose from, `'b'` for [bottleneck distance](https://en.wikipedia.org/wiki/Topological_data_analysis) and `'w'` for [Wasserstein distance](https://en.wikipedia.org/wiki/Wasserstein_metric). 

With a matching truth matrix, you could implement a [kNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).

In [6]:
# gives misleading results
D, labels = get_dist_mat(SE_all, metric='b')

print("Distance matrix:")
print(D)

# note that because our directory contained expanded and nonexpanded
# versions of the same song, there are duplicate names in labels
print("\nLabels:")
print(labels)

Distance matrix:
[[ 0.  36.  36.  36. ]
 [36.   0.  21.5 21.5]
 [36.  21.5  0.  18.5]
 [36.  21.5 18.5  0. ]]

Labels:
['mazurka06-1', 'mazurka06-2', 'mazurka06-1', 'mazurka06-2']
