motif_diss.tex



\documentclass[11pt]{article}
\usepackage{amsmath,cite}
\usepackage{graphicx}
\usepackage{courier}
\usepackage{authblk} 
\usepackage[ margin=1.8in]{geometry}
\usepackage{listings}
\usepackage{color}
\usepackage{verbatim}
\usepackage{fancyhdr}
 \usepackage{hyperref}
 
 \usepackage{xcolor}
\hypersetup{
  colorlinks   = true, %Colours links instead of ugly boxes
  urlcolor     = red, %Colour for external hyperlinks
  linkcolor    = blue, %Colour of internal links
  citecolor   = blue %Colour of citations
}
 
\pagestyle{fancy}
\lhead{MOTIF DISCOVERY}
\rhead{ }

\newcommand{\figref}[1]{\figurename~\ref{#1}}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\renewcommand\Authfont{\small}
\renewcommand\Affilfont{\itshape\footnotesize}
%----------------------------------------------------------------------------
\renewcommand{\baselinestretch}{1.1}
 \renewcommand{\familydefault}{\rmdefault}
%----------------------------------------------------------------------------

%title
% ------
\title{Speech Prosody F0 Time-series Motif Discovery in the
CMN corpus}

\author{Shuo Zhang}
\affil{LING/COSC, Georgetown University\\Music Technology Group, Universitat Pompeu Fabra}
\date{ }


\begin{document}
\maketitle


\section{Premise}
This report describes the motif discovery of tone time-series data in the CMN corpus. Part of it is like a documentation in addition to describing the data, procedure, experimental setup, and analysis. 

\section{Data Collection and Extraction}
\subsection{Corpus}
We are using the CMN corpus. Mandarin Chinese Phonetic Segmentation and Tone (MCPST) corpus
was developed by the Linguistic Data Consortium (LDC) and contains 7,849 Mandarin Chinese ”utterances” and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. Utterances were considered as the time-stamped between-pause units in the
transcribed news recordings. Those with background noise, music, unidentified speakers and accented speakers were excluded. A test set was developed with 300 utterances randomly selected from six speakers (50 utterances for each speaker). The remaining 7,549 utterances formed a training set. This data set is unique in that it contains annotations on the segmentation and
identity of syllabic tones (whereas other newscast data sets contain only audio and transcripts). The utterances in the test set were manually labeled and segmented into initials and finals in romanized pinyin. Tones were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral tone. The Sandhi Tone3 was labeled as Tone2. The training set was automatically segmented and transcribed using the LDC forced aligner, which is a Hidden Markov Model (HMM) aligner trained on the same utterances. The aligner achieved 93.1\% agreement (of phone boundaries) within 20 ms on the test set compared to manual segmentation. The quality of the phonetic transcription and tone labels of the training set was evaluated by checking 100 utterances randomly selected from it. The 100 utterances contained 1,252 syllables: 15 syllables had mistaken tone transcriptions; two syllables showedmistaken transcriptions of the final, and there were no syllables with transcription errors on the initial.Each utterance has three associated files: a flac compressed wav file, a word transcript file, and a phonetic boundaries and label file.

\subsection {Data Extraction}
%all the different versions of unigram, bigram, trigram, ...
The preprocessing report discusses the speech processing steps in details. 

\subsection{Running MK database version: Documentation}
The MK\_DB Windows executable returns a pair of top motif each run. It also reports as a byproduct a series of motif pairs involving the reference subsequence along the way. The way we want to run this is (for a particular data file, e.g., bigram 300-point) to run it a few time (10 times), and we extract all motif pairs from its output, sorted by the distance. Typically we get about less than 300 motif pairs from 10 iterations. We have written a series of python scripts to automate this task, first \texttt{run\_MK\_db.py} to generate a \texttt{\_tuple} file for each data file, recording the dist, ts1, ts2 as a 3-tuple of each motif pair. 

Then we have a few ways to visualize or inspect the motif pairs found. The first way is \texttt{plot\_motifs.py}, where we generate a multi-page pdf file that shows plots of all pairs of motifs in a tuple file along with its basic metadata attributes. From these attributes you can also look up their more detailed information using their index and file name identifier from the original data. A series of data, dist, and attributes objects is pickled for each tuple file so that you can look up a motif pair fairly quickly in a python interpreter env. This method of visualization gives you an overview of all motif pairs discovered from a data set and their distances. One observation is that in the unigram 100 point data set, the top motif pairs that have the smallest distances also tend to be not very interesting - in fact they tend to be very linear (straight lines) without any contours. Down the distance-sorted list   we will see some more interesting patterns in the middle range. In the lower ranked pairs we actually will see some examples of motif pairs that are quite dissimilar - therefore the distance pruning should be necessary as discussed in the next paragraph. However, one shortcoming of this method is that we cannot manually inspect 200+ motif pairs and inspect their metadata or audio. In addition we cannot capture the cluster structure of all the subsequences (which are not all unique) in the motif pairs. Therefore we will move forward to our second way of analysis. 

The second way is to build a graph and perform some graph/network analysis. Recall that in the results from MK\_DB, there are some time series subsequences that have occurred multiple times since they serve as the reference points. We therefore use the \texttt{networkx} library in Python to perform a graph visualization and analysis. First we build a graph by reading edgelist file (these files are converted from \texttt{\_tuple.txt} files and are named \texttt{\_tuple\_graph.txt} by convention. Then we can visualize the graph and see the structure of a few centers (reference motifs) and other connected components (example Figure \ref{fig:graphex}). Figure \ref{fig:degreeuni} shows the distribution of degrees of the motif nodes in the unigram 100 point graph. In Figure \ref{fig:inspectuni} we inspect motif number 13436 as the reference motif and its cluster. We also listed the metadata attributes associated with each subsequence so we can trace them back to the original data. With this unigram motif it is not that interesting but we are seeing most tone labels are consistent in this cluster. Our aim is to discover more interesting patterns and clusters in larger units of N-grams.

%%%%%%%%%%%%%%%%%%%%%%% F I G U R E %%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\small
 \centerline{
 \includegraphics[scale=0.5]{graph_example.png}}
 \caption{Unigram 100-point motif clusters graph}
 \label{fig:graphex}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%% F I G U R E %%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\small
 \centerline{
 \includegraphics[scale=0.5]{unigram_dg.png}}
 \caption{Unigram 100-point motif graph degree distribution}
 \label{fig:degreeuni}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%% F I G U R E %%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\small
 \centerline{
 \includegraphics[scale=0.5]{inspect_uni.png}}
 \caption{Inspect a unigram 100-point motif (13436)}
 \label{fig:inspectuni}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

One thing we should note before we move on to the analysis of N-grams graph analysis is that in the old version of MK db algorithm, only the top motif pair is returned and others are along the way as a byproduct. But in the meantime we can think of running this many iterations as a pruning motif clusters by distance (as we sort tuples by distance, we are effectively applying a cut-off at the largest distance). But some distances are probably too large that they should not be included in this cluster. On the other hand we also note that this cluster is somewhat arbitary and not exhaustive, since the reference points are chosen by some heuristics and not intended to return all similar subsequences within R range as a \textit{range motif}, or the top k motif as defined in the MK paper. 


\section{MK Motif discovery: Database version}
We first run the MK algorithm for exact discovery of time-series motifs (\cite{Mueen2009}). 

\subsection{Unigram motif discovery}

%100 point, ...
%graph analysis

\subsection {Bigram motif discovery}


\subsection {Trigram motif discovery}


\section{MK motif discovery: subsequence version}

\subsection{MK motif discovery: top K motif clusters}


\section{Time-series motifs Matrix Profile}


\bibliographystyle{alpha} 
\bibliography{shuo}
%\bibliographystyle{acm}

\end{document}