# The Run_SNLD Module

The `run_SNLD` module of the `repytah-se` constructs and analyzes Start Normalized Length (SNL) Diagrams. The SNL diagrams are made from Aligned Hierarchies (AH) and based on Topological Data Analysis. The module allows the user to process directories that hold multiple files and process single files. 

SNLD Diagrams are better at comparing songs for the cover song detection task compared to Aligned Hierarchies, whose strength lies in visualization, and SE Diagrams, because they can compare songs of different lengths. 



## Pipeline

The overall pipeline for `repytah-se` is shown below.

<img src="pictures/SNLflow.jpg" width="380">



To create a distance matrix from SE/SNL diagrams , it is recommended to provide multiple files.

You can choose to start from chroma vector files, however you can also start from aligned hierarchies if you have already processed them. There is also the option to process a single file to SE/SNL diagrams or process a directory of multiple files to SE/SNL diagrams and distance matrices. There will be examples provided to help you with the possible options. 

## Import Modules


In [1]:
import numpy as np
import pandas as pd
import persim
import pkg_resources

from run_SNLD import *


## Processing a Single File

This phase is the integration between `repytah` and `repytah-se`. We create aligned hierarchies from the chroma vectors of a music-based data stream (eg. a song) in `.csv` files. In this step, we choose a threshold value to filter out noise and a number of feature vectors per shingle for contextualization. If running multiple tests with the same threshold and number of feature vectors per shingle values, it is recommended to save the intermediate aligned hierarchies. Please note that while `repytah` contains methods to save aligned hierarchies as `.csv` or `.mat` files, the intermediate aligned hierarchies in this module will be saved in `.mat` format. 

For more on how aligned hierarchies are created, see `example_vignette.ipynb`.

We are using Chopin's Mazurka Op.6, No.1 as input for this demonstration.

In [2]:
# starting from chroma vectors

filepath = 'data/input.csv'
num_fv_per_shingle = 12
thresh = 0.02
norm_type = "std"
alpha = 1

# default for isChroma is True
SNL_diagram = get_SNLDs(filepath, num_fv_per_shingle, thresh, norm_type, alpha, isChroma=True, save=False)

print(len(SNL_diagram))


157


In [3]:
# starting from AH

filepath ='data/input.mat'
norm_type = "std"
alpha = 1

# these values don't matter since starting from AHs
num_fv_per_shingle = 12
thresh = 0.02

SNL_diagram = get_SNLDs(filepath, num_fv_per_shingle, thresh, norm_type, alpha, isChroma=False, save=False)

print(len(SNL_diagram))




157


## Processing Multiple Files

This module is based on Chopin's Mazurka Dataset, which has expanded and nonexpanded forms for each piece. Thus, the data directories are set up as below. For the smoothest running out of the box, it is recommended that you also organize your files in this fashion, but modified for your purposes.

Before running, the directory should look like this:
<img src="pictures/begin_directoryascii.png" width="300">

If saving intermediate Aligned Hierarchies, the directory will look like this:
<img src="pictures/after_directoryascii.png" width="380">

In [4]:
dir_path ='data/chroma_vectors' # can also start from AHs

num_fv_per_shingle = 12
thresh = 0.02
norm_type = "std"
alpha = 1

# print(dirs)
SNLD_directory = None
SNLD_all = get_SNLD_directory(dir_path, num_fv_per_shingle, thresh, norm_type, alpha, isChroma=True, save=False)
print("There are {} total SE diagrams in {}".format((len(SNLD_all.SNLDs) * len(SNLD_all.SNLDs[0])), dir_path))
print("The classes are:", SNLD_all.className)
print("The labels for each SE diagram are:", SNLD_all.labels)


There are 4 total SE diagrams in data/chroma_vectors
The classes are: ['\\expanded', '\\nonexpanded']
The labels for each SE diagram are: [['mazurka06-1', 'mazurka06-2'], ['mazurka06-1', 'mazurka06-2']]


## get_dist_mat

This function is used to pairwise compare songs. However, it is recommended to do this with SNL Diagrams instead of SE diagrams because in SNL Diagrams, you can compare two songs of different length. 

There are two distance metrics to choose from, `'b'` for [bottleneck distance](https://en.wikipedia.org/wiki/Topological_data_analysis) and `'w'` for [Wasserstein distance](https://en.wikipedia.org/wiki/Wasserstein_metric). 

With a matching truth matrix, you could implement a [kNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).

In [5]:
D, labels = get_dist_mat(SNLD_all, metric='w')

print("Distance matrix:")
print(D)

# note that because our directory contained expanded and nonexpanded
# versions of the same song, there are duplicate names in labels
print("\nLabels:")
print(labels)

Distance matrix:
[[   0.         2761.42577757 2560.66138326 3040.75517384]
 [2761.42577757    0.          256.85147665  297.78114228]
 [2560.66138326  256.85147665    0.          487.35995357]
 [3040.75517384  297.78114228  487.35995357    0.        ]]

Labels:
['mazurka06-1', 'mazurka06-2', 'mazurka06-1', 'mazurka06-2']
