# The Search Module


The `search` module of the `ah` package holds functions used to find and record the diagonals in the thresholded matrix, T. These functions prepares the found diagonals to later be transformed and assembled. 

- **find_complete_list**: Finds all smaller diagonals (and the associated pairs of repeats) that are contained pair_list, which is composed of larger diagonals found in find_initial_repeats. 

- **\_\_find_add_srows**: Finds pairs of repeated structures, represented as diagonals of a certain length, k, that start at the same time step as previously found pairs of repeated structures of the same length.

- **\_\_find_add_erows**: Finds pairs of repeated structures, represented as diagonals of a certain length, k, that end at the same time step as previously found pairs of repeated structures of the same length.

- **\_\_find_add_mrows** - Finds pairs of repeated structures, represented as diagonals of a certain length, k, that neither start nor end at the same time steps as previously found pairs of repeated structures of the same length.   

- **find_all_repeats**: Finds all the diagonals present in thresh_mat. This function is nearly identical to find_initial_repeats, with two crucial differences. First, we do not remove diagonals after we find them. Second, there is no smallest bandwidth size as we are looking for all diagonals.

- **find_complete_list_anno_only**: Finds annotations for all pairs of repeats found in find_all_repeats. This list contains all the pairs of repeated structures with their start/end indices and lengths.    

The following functions are imported from the [`utilities`](../ah/blob/master/aligned-hierarchies/utilities.py) to reformat outputs and assist with the operations of the `search` functions.  

- stretch_diags
- add_annotations 
- \_\_find_song_pattern

![alt text](function_pipeline.png)

### Importing necessary modules

In [5]:
#used for mathematical calculations 
import numpy as np

#search module
from search import *
from search import __find_add_erows, __find_add_mrows, __find_add_srows

#utilities module 
from utilities import * 
from utilities import __find_song_pattern


## find_complete_list

As seen in the flow chart, `find_intial_repeats` is called by `example` right before `find_complete_list`. In `find_complete_list`, smaller pairs of repeats are added to the original list of pairs of repeats made in `find_initial_repeats`. All of the pairs of repeats correspond to each repeated structure in another numpy array called thresh_mat. This array holds all the repeated structures in a sequential data stream and the repeated structures are represented as diagonals.   

The inputs for the function are:

- pair_list (np.ndarray):  pairs of repeats found in [`find_initial_repeats.py`](../ah/blob/master/vignettes/utilities_vignette.ipynb). 
   
- song_length (int):  the number of audio shingles  

The output for the function is: 

- lst_out (np.ndarray):  pairs of repeats with the added smaller repeats   

In [6]:
# Example 1
# Inputs: 
pair_list = np.array([[1, 15, 31, 45, 15], 
                      [1, 10, 46, 55, 10], 
                      [31, 40, 46, 55, 10],
                      [10, 20, 40, 50, 15]])
song_length = 55

print("The input array is: \n", pair_list)
print("The number of audio shingles is: \n", song_length)

The input array is: 
 [[ 1 15 31 45 15]
 [ 1 10 46 55 10]
 [31 40 46 55 10]
 [10 20 40 50 15]]
The number of audio shingles is: 
 55


In [7]:
output = find_complete_list(pair_list, song_length)

print("The output array is: \n", output)

The output array is: 
 [[ 1 10 46 55 10  1]
 [31 40 46 55 10  1]
 [ 1 15 31 45 15  1]
 [10 20 40 50 15  2]]


## \_\_find_add_srows

Finds pairs of repeated structures, representated as diagonals of a certain length that start at the same time step as previously found pairs of repeated structures of the same length. 

The inputs for the function are: 

- lst_no_anno (np.ndarray): pairs of repeats      
- check_inds (np.ndarray): list of ending indices of repeats 
- k (int): length of repeat that we are looking for

The out for the function is: 
- add_rows (np.ndarray): newly found pairs of repeats of length K 


In [8]:
lst_no_anno = np.array([[ 1, 15, 31, 45, 15],
                        [ 1, 10, 46, 55, 10],
                        [31, 40, 46, 55, 10],
                        [10, 20, 40, 50, 15]])
check_inds = np.array([ 1, 31, 46])
k = 10

print("The input array is: \n", lst_no_anno)
print("The indices of repeats to check: \n", check_inds)
print("The length of repeat we are searching for: \n", k)

The input array is: 
 [[ 1 15 31 45 15]
 [ 1 10 46 55 10]
 [31 40 46 55 10]
 [10 20 40 50 15]]
The indices of repeats to check: 
 [ 1 31 46]
The length of repeat we are searching for: 
 10


In [10]:
output = __find_add_srows(lst_no_anno, check_inds, k)

print("The output arrary is: \n", output )

The output arrary is: 
 [[ 1 10 31 40 10]
 [11 15 41 45  5]
 [ 1 10 31 40 10]
 [11 15 41 45  5]]


## \_\_find_add_erows

Finds pairs of repeated structures, representated as diagonals of a 
certain length that end at the same time step as previously found pairs of repeated structures of the same length.

The inputs for the functions are: 

- lst_no_anno (np.ndarray): pairs of repeats
- check_inds (np.ndarray): list of ending indices of repeats
- k (int): length of repeats that we are looking for 

The output for the function is: 
- add_rows (np.ndarray): newly found pairs of repeats of length k 

In [11]:
lst_no_anno = np.array([[ 1, 15, 31, 45, 15],
                        [ 1, 10, 46, 55, 10],
                        [31, 40, 46, 55, 10],
                        [10, 20, 40, 50, 15]])
check_inds = np.array([10, 40, 55])
k = 10

print("The input array is: \n", lst_no_anno)
print("The indices of repeats to check: \n", check_inds)
print("The length of repeat we are searching for: \n", k)

The input array is: 
 [[ 1 15 31 45 15]
 [ 1 10 46 55 10]
 [31 40 46 55 10]
 [10 20 40 50 15]]
The indices of repeats to check: 
 [10 40 55]
The length of repeat we are searching for: 
 10


In [12]:
output = __find_add_erows(lst_no_anno, check_inds, k)

print("The output arrary is: \n", output )

The output arrary is: 
 []


## \_\_find_add_mrows

Finds pairs of repeated structures, represented as diagonals of a certain
length that neither start nor end at the same time steps as previously
found pairs of repeated structures of the same length. 

The inputs for the functions are: 

- lst_no_anno (np.ndarray): pairs of repeats
- check_inds (np.ndarray): list of ending indices of repeats
- k (int): length of repeats that we are looking for 

The output for the function is: 
- add_rows (np.ndarray): newly found pairs of repeats of length k 

In [13]:
lst_no_anno = np.array([[ 1, 15, 31, 45, 15],
                        [ 1, 10, 46, 55, 10],
                        [31, 40, 46, 55, 10],
                        [10, 20, 40, 50, 15]])
check_inds = np.array([ 1, 31, 46])
k = 10

print("The input array is: \n", lst_no_anno)
print("The indices of repeats to check: \n", check_inds)
print("The length of repeat we are searching for: \n", k)

The input array is: 
 [[ 1 15 31 45 15]
 [ 1 10 46 55 10]
 [31 40 46 55 10]
 [10 20 40 50 15]]
The indices of repeats to check: 
 [ 1 31 46]
The length of repeat we are searching for: 
 10


In [14]:
output = __find_add_mrows(lst_no_anno, check_inds, k)

print("The output arrary is: \n", output )

The output arrary is: 
 []


##  find_all_repeats 

Finds all the diagonals present in thresh_mat. This function is nearly identical to find_initial_repeats, with two crucial differences. First, we do not remove diagonals after we find them. Second, there is no smallest bandwidth size as we are looking for all diagonals.

The inputs for the function are: 

- thresh_mat (np.ndarray): thresholded matrix that we extract diagonals from
- band_width_vec (np.ndarray): vector of lengths of diagonals to be found

The output for the function is:

- all_lst (np.ndarray): pairs of repeats that correspond to diagonals in thresh_mat

In [15]:
thresh_mat = np.array([[0, 0, 0, 0, 0],
                       [0, 1, 0, 1, 0],
                       [0, 0, 0, 0, 0],
                       [0, 1, 0, 1, 0],
                       [0, 0, 0, 0, 0]])

bandwidth_vec = np.array([[1, 2, 3, 4, 5]])

print("The threshold matrix is: \n", thresh_mat)
print("The lengths of the diagonals to be found are: \n", bandwidth_vec)

The threshold matrix is: 
 [[0 0 0 0 0]
 [0 1 0 1 0]
 [0 0 0 0 0]
 [0 1 0 1 0]
 [0 0 0 0 0]]
The lengths of the diagonals to be found are: 
 [[1 2 3 4 5]]


In [16]:
output = find_all_repeats(thresh_mat, bandwidth_vec)

print("The output array is: \n", output )

TypeError: only integer scalar arrays can be converted to a scalar index

## find_complete_list_anno_only

Finds annotations for all pairs of repeats found in `find_all_repeats.py`. This list contains all the pairs of repeated structures with their start/end indices and lengths.

The inputs for the function are: 

- pair_list (np.ndarray): pairs of repeats 
- song_length (int): number of audio shingles

The output for the function is: 

- out_lst (np.ndarray): pairs of repeats with added smaller repeats and annotations 

In [17]:
pair_list = np.array([[2, 2, 4, 4, 1]])
song_length = 5 

print("The pairs of repeats are: \n", pair_list)
print("The number of audio shingles in the song are: \n", song_length )

The pairs of repeats are: 
 [[2 2 4 4 1]]
The number of audio shingles in the song are: 
 5


In [19]:
output = find_complete_list_anno_only(pair_list, song_length)

print("The output array is: \n", output)

The output array is: 
 [[2 2 4 4 1 1]]
