First, transfer the audio into chroma vector, then generate the Self-Similarity Matrix (SSM) of chroma with cosine-similarity.
The following picture is the SSM of chroma.

Next, we need to distinct the lines between every adjacent blocks. The lines called novelty. In order to find the novelty, we use a simple kernel to scan the SSM from bottom-left to top-right.
Then, we plot the output value as the novelty curve. Following picture is the novelty curve.

Finally, we do post-process to the novelty curve with the following rules.
- Min distance between peaks: 2 sec
- Discard the peaks of low novelty (last 15%)
https://github.com/CodeGoood/Music-Segmentation/blob/master/pic/output.m4a
The peak sound in the audio is the phrase position we predict.
- A phrase may consist of multiple chords
- A phrase may begin/end within a chord
- Need information more than chroma (e.g. vocal onset/offset)