## Speech Synthesis (Holmes & Holmes, 2001)
- Message synthesis from stored waveforms is a long-established technique for providing a limited range of spoken information.  The simplest systems join together word-size units.  The technical quality of the speech can be high, but it is not possible to produce good results for a wide range of message types
- Synthesis from diphones gives complete flexibility of message content, but is limited by the difficulty of making the diphones represent all the articulation effects that occur in different phonetic environments.  To obtain the best quality from diphone synthesis, care must be taken in selecting and excising the examples.  Quality may be improved by adding to the simple diphone set to include allophone-based units or longer units spanning several phones.
- Vocoders rquire a small fraction of the memory needed for simple waveform storage, and also make it easy to vary the pitch and timing, and to smooth the joins between any units being concatenated.  Synthesis quality is, however, limited by the inherent vocoder quality.
- The technique known as pitch-synchronous overlap-add (PSOLA) allows good synthesis quality to be achieved by concatenating short waveform segments.  Smooth joins are obtained by concatenating segments pitch-synchronously and overlapping the end of one segment with the start of the next.
- By decomposing the speech signal into pitch periods with overlapping windows, prosodic modifications are easy with PSOLA.  Timing can be modified by repeating or removing individual pitch periods, and pitch can be changed by altering the spacing between windows before resynthesis.
- Time-domain PSOLA is simple to implement, but needs a lot of memory and cannot smooth any spectral discontinuities occuring at segement boundaries.
- Other variants of PSOLA address the above limitations by incorporating some parametric representation of the speech, such as LPC or MBE coding, while retaining the PSOLA technique for prosodic modifications.
- The hardware cost for synthesis from stored human speech is dominated by the memory requirements except for multi-channel systems.

### Synthesis by Rule
- Phonetic sysnthesis by rule involves applying acoustic-phonetic rules to generate synthesizer control parameters from a description of an utteracnce in terms of a sequence of phonetic segments together with prosodic information.
- Most acoustic-phonetic rule systems are designed for a formant synthesizer.
- A convenient implementation is to store the rules as tables of numbers for use by a single computational procedure
- Typically, a table for each phone holds some target synthesizer control values, together with transition durations and information used to calculate the controls at the nominal boundary between any pair of phones.  Such a system can capture musch of the coarticulation effects between phones.
- Separate tables can be included for any allophone variation which is not captured by the coarticulation rules.  The total number of different units will still be far fewer than the number required in a concatenative system.
- Acoustic-phonetic rule systems have tended to be set up 'by-hand', but automatic procedures can be used to derive the paraemeters of these systems, based on optimizing the match of the synthesized speech to phonetically transcribed natural speech data.


### Template Matching speech recognition
- Most early successful speech recognition systems worked by pattern matching on whole words.  Acoustic analysis, for example by a bank of band-pass filters, describes the speech as sequence of feature vectors, which can be compared with stored templates for all the words in the vocabulary using a suitable distance metric.  Matching is improved if speech level is coded logarithmically and level variations are normalised.
- Two major problems in isolated-word recognition are end-point detection and timescale variation.  The timescale problem can be overcome by dynamic programming (DP) to find the best way to align the time scales of the incoming word and each template (known as dynamic time warping).  Performance is improved by using penalties for timescale distortion.  Score pruning, which abandons alignment paths that are scoring badly, can save a lot of computation.  
- DP can be extended to deal with sequences of conneceted words, which has the added advantage of solving the end-point detection problem.  DP can also operate coninuously, outputting words a second or two after they have been spoken.  A wildcard template can be provided to cope with extraneous noises and words that are not in the vocabulary.
- A syntax is often provided to prevail illegal sequences of words from being recognized.  This method increases accuracy and reduces the computation.

### Large vocabulary speech recognition
- Some large-vocabulary recognition tasks may require accurate transcription of the words that have been said, while others will need understanding of semantic content (but not necessarily accurate recognition of every word).
- For speech transcription the task is to find the most likely sequence of words, where the probability for any one sequence is given by the product of the acoustic-model and language model probabilities.
- The principles of continuous speech recognition using HMMs can be applied to large vocabularies, but with special techniques to deal with the large number of different words that need to be recognized.
- It is not practical or useful to train a separate model for every word, and instead sub-word models are used.  Typically phone-size units are chosen, with pronunciation of each word being provided by a dictionary.
- Triphone models represent each phone in the context of its left and right neighbours.  The large number of possible triphones is such that many will not occur in any given set of training data.  Probabilities for these triphones can be estimated by 'backing off' interpolating with biphones (dependent on only the left or the right context) or even context-independent monophones.
- Another option, which allows greater context specificity to be achieved, is to group ('cluster') similar triphones together and share ('tie') their parameters.  A phonetic decision tree can be used to find the best way to cluster the triphones based on questions about phonetic context.  The idea is to optimize the fit to the data while also having sufficient data available to train each tied state.
- An embedded training procedure is used, typically starting by estimating parameters for very general monophone models for which a lot of data are available.  These models are used to initialize triphone models.  The triphones are trained and similar states are then tied together.  Multiple-component mixture distributions are introduced at the final stage.
- The purpose of the language model is to incorporate langauage constraints, expressed as probabilities for different word sequences.  The perplexity, or average branching factor, provides a meaure of how good the language modelis at predicting the next word given the words that have been so far.
- N-grams model the probability of a word depending on just the immedately N-1 words, where typically N=2 ('bigrams') or N=3 ('trigrams'). 
- The large number of different possible words is such that data sparsity is a massive problem for language modelling, and special techniques are needed to estimate probabilities for N-grams that do not occur in the training data.
- Probabilities for N-grams that occur in the training text can be estimated from frequency counts, but some probability must be 'freed' and made available for those N-grams that do not occur.  Probabilities for these unseen N-grams can then be estimated by backing off or interpolating with more general models.
- The principles of HMM recognition extend to large vocabularies, with a multiple-level structure in which phones are represented as networks of states, words as networks of phones, and sentences as networks of words.  In practice the decoding task is not straightforward due to the very large size of the search space, especially if cross-word triphones are used.  Special treatment is also required for langauge models whose probabilities depend on more than the immediately preceeding word (i.e. for models more complex than bigrams).  The one-pass Viterbi search can be extended to operate with cross-word triphones and with trigram language models, but the search space becomes very large and is usually organized as a tree.  Efficient pruning is essential.
- An alternative search strategy uses multiple passes.  The first pass identifies a restricted set of possibilities, which are typically organized as an N-best list, a word lattice or a word graph.  Later passes select between thee possibilities.  Another option is to use a depth-first search.
- Automatic speech understanding needs further processing of the speech recogniser output to analyse the meaning, which may involve syntactic and semantic analysis.  To reduce the impact of recogntion errors, it is usual to start with an N-best list or word lattice.  Partial parsing techniques can be used for syntactic analysis to deal with the fact that the spoken input may be impossible to parse completely because parts do not fit the grammar, due to grammitacal errors, hesitations and so on.
- Meaning is often represented using templates, which will be specific to the application and have 'slots' that are filled by means of a linguistic analaysis.
- In a spoken dialogue systems, a dialogue manager is used to control the interaction with the user and ensure that all necessary information is obtained.
- ARPA has been influential in promoting progress in large vocabulary recognition and understanding, by sponsoring the collection of large databases and running series of competitive evaluations.  Error rates of less than 10% have been achieved for transcribing unlimited-vocabulary read speech and for understanding spoken didalogue queries.  Recognition of more casually spoken spontaneous speech is still problematic.


## Speech Recognition (Allen, 1994)

Speech is the predominant mode of human communication.  Certainly writen language is important, and much of knowledge is pased from generation to generation in written form, but speech is our preferred mode for everyday interaction.  It is natural to assume that speech will also be the preferred mode for human-machine interaction as well.  Speech is very efficient and convenient, and allows the hands to be free for performing other tasks.   Until recently however, speech recognition systems have not been accurate or fast enough to support useful applications.  This is rapidly changing as new recognition techniques and faster machines appear.

### Issues in Speech recognition

Speech recognition fall into two classes.  An isolated word recognition system recognizes one word at a time.  To use such a system, you must pause between each word.  A continuous speech recognition system recognizes speech as we normally speak it, with words flowing together in a continuous stream.  Most systems currently on the market use isolated word recognition techniques.  Continuous speech recognition systems are under active development, howver, and are nearing practical use. Other major factors that distinguish different systems are vocabulary size and the range of speakers that can be handled.  Systems range from low-end recognizers that can recognize 30 or so words from a single speaker to high-end systems that can recognize 20,000 words from multiple speakers.  When comparing different recognition rates, it is important to remember that it is much more difficult to attain high accuracy rates with large-vocabulary, multiple-speaker continuous speech recognition.

While the same basic techniques for parsing, semantic interpretation, and contextual interpretation can be used for spoken or written language, there are some significant differences that affect system design.  For instance, with spoken input the system has to deal with uncertainty.  In written language the system knowns exactly what was said.  In addition, spoken language is structurally quite  different than written language.  In fact, sometimes a transcript of perfectly understandable speech is not comprehensible when read.  Spoken language occurs more incrementally, a phrase at a time, and contains considerable intonational information that is not captured in written form.  It also contains many repairs, in which the speaker corrects or rephrases something that was just said.  In addition, spoken dialogue as a rich interaction of acknowledgement and confirmation that maintains the conversation, which doesn't appear in written form.



### Speech Recognition Architecture

The basic architecture is given in figure 1 below.  The sounds produced by a speaker is converted into digital form by an analog to digital converter.  This signal is then processed to extract various features, such as the intensity of sound at different frequences and the change in intensity over time.  These features serve as input to the speech recognition system which general uses Hidden Markov Models (HMM) techniques to identify the most likely sequence of words that could have produced the output.  The speech recognizer then otputs the likely sequence of words to serve as input to the NLU system.  When the NLU needs to generate an utterance, it passes a sentence to a module that translates the words into a phonemic sequence and determines an intonational contour, and then passes this information on to a speech synthesis system, which produces the spoken output.
![Figure 1](https://selene.hud.ac.uk/u1273400/images/seg_media/a1.PNG)

**Figure 1**: Speech understanding Architecture

### Sound Structure of Language

The structure of the sounds in spoken language can be accounted for at two different levels.  The phonemic level classifies sound in terms of its use in language, and the acoustic level calssifies sound in terms of its physical properties.

The phonemeic level divides speech into primitive components of meaning, much as morphology divides words into components of meaning.  To do this, one must consider all the different sounds used in words in a langauge and cluster them twogether so that two sounds are never used to distingusih one word from another are put into the same group.  Thus we may group different variations of the same phonemes (allophones) toegether.

Phonemes can be classified into common groups based on certain distinctive features that they share.  For instance, all vowels and some consonants involve the use of the vocal chords to produce a sound, while other consonants, such as /s/ and /sh/, do not involve the vocal chords.  The former are called the voiced phonemes, while the latter are uncoiced.  Some consonants such as /p/ and /k/, involve stopping the flow of air momentarily and then releasing the air quickly.  These groups are called stops or plosives.  Other consonants such as /n/ and /m/, involve using the nasal passage to change the sound are called nasal phonemes.  There are many differnt ways to classify the phonemes depending on how they are produces and the phoneme set can be represented as a set of features that uniquely identify each phoneme.

The other level of analysis important for speech recognition is the acoustic level.  At this level sounds are classified by physical characterisitics of the speech signal signal.  Essentially, a speech recognition system must identify a way of matching various templates of sounds against the new input in order to identify what words have been spoken.  Unfortunately, the speech signal itself is far too complicated to represent directly, and the signals arising from two occurrences of the same word may differ dramatically, even within a single speaker.  The signal is affected greatly by the rate of speech, background noise, emotional state of the speaker, all factors not directly related to the words uttered.  The trick is to identify a set of acoustic features that capture those aspects of the signal that appear consistently across many instances of the same sound. 

One of the proncipal techniques for identifying significant features is to use spectral analysis, a frequency analysis of the signal.  This information can be obtained using analog filters on the input, or using digital methods such as the Fast Fourier Transform.  This information is very useful for distinguishing the different sound patterns.  For example, any voiced sounds will tend to have considerable intensity in low frequency ranges, while unvoiced consonants will tend to have their frequency more evenly spread across all frequencies.  To see this consider the figure below which shows us a spectrogram that indicates the signal intensity at different frequency ranges during the word sad.  The frequency varies on the vertical axis, from 0 to 4000Hz (cycles per sec).  Time is indicated on the horizontal axis.  In this example, the word occurred between 3.9 to 4.6 seconds from the start of the utterance.  Black indicates silence at that frequency.  The word sad consists of 3 phonemes /s/ /ae/ and /d/.  You can see that the /s/ between times 3.9 and 4.1, with the intensity over a broad range of the higher frequencies, especially between 2000 and 4000 Hz.  The vowel starts just after 4.1 with a marked frequency in lower frequency ranges.  Notice three strong peaks at approximately 660, 1700 and 2400 Hzz.  These are characteristic of voiced phonemes and are called formants.  This cobination of formant frequencies is typical of the vowel /ae/.  The third formant loses intensity as the vowel continues, and the first formant rises and drops slightly.  At 4.4 there is an abrupt stop of the sound, essentially yielding a silence.  This is typical of stop consonants, including /d/.  Shortly before 4.5 the sound resumes.  Notice some weak formant peaks around 4.5.  These arise because /d/ is a voiced consonant.  If the word had been sat instead of sad, then this last section of the signal would look similar to the /s/ sound at the begining, as /t/ would be unvoiced.
![Figure 1](https://selene.hud.ac.uk/u1273400/images/seg_media/a3.PNG)

The first three formants are quite relaiable indicators of the vowel.  Figure C.4 above shows some data on the average frequency of the first three formants for some english vowels.  The formants reflect the resonances from shape of the vocal tract and are distinct from the frequency produced by the vocal chords, which is called the pitch or the fundamental frequency.  The fundamental frequency varies considerably in most speech as part of natural intonation to convey questions, make assertions, express surprise, and so on.  But even as it varies, the formants for vowels remain relatively constant, as they reflect the shape of the vocal tract.

Unfortunately, there is not a one-to-one correspondence between phonemes and acoustic features.  For instance, the phoneme /t/ is part of the phonemic spelling of the words *time, string, stripe,* and *fit*.  But listen carefully to each sound as you say the words.  In each case the /t/ contributes quite a different sound to the word as it combines with the surrounding phonemes.  If you look at a spectrogram for each, the distinctions are even more prominent.  Even with the same word the sounds corresponding to phonemes will differ considerably depending on whether the words are stressed or reduced in the utterance.  While stressed can generally be recognised reliably, when reduced in unstressed words or rabid speech, they are often so degraded that different vowels are completely indistinguishable from each other.  These properties make speech recognition considerably more difficult than expected.  

### Signal Processing

A microphone converts sound into varying electrical current that corresponds to the complex sound wave.  The existence of stereo systems demonstrates that this electrical signal contains all the relevant information present in the sound since, using a loudspeaker, it can be converted back into sound that is virtually indistinguishable from the original.  To use such a signal as input to a computer, it must be digitized.  This is performed by an analog-to-digital (A/D) converter, a standard component of virtually all modern personal computers.  There are two important factors in digitizing a signal.  The first is the sampling rate, how often the current is measured, and the second is the quantization factor, the number of different levels of intensity that can be distinguished.  The sampling rate determines how high a frequency in the signal can be accurately recorded.  A theorem in onformation theory states that the signal must be sampled at twice the rate of the highest frequency required in the analysis.  As a point of reference perceptual studies generally indicate that frequency up to about 10kHz (10,000 cycles per second) occur in speech, but speech remains intelligible within a considerably narrower range).  In general, a sampling rate between 8 to 20 kHz is used for speech recognition applications

The second important factor is the quantization factor, which determines what scale is used to represent the signal intensity.  For instance, a 1-bit quantisation would only be able to indicate whether a signal is higher or lower than some mid-range frequency. A 2-bit measurement would distinguish four different levels, and so on.  Generally, it appears that 11-bit numbers cappture sufficient information, although by using a log scale, you can get by with 8-bit numbers.

So a typical representation of a speech signal is a stream of 8-bit numbers arriving at the rate of 10,000 numbers per second -- clearly a large amount of data.  The challenge for speech recognition is to reduce this data to a manageable representation.  In fact, most current spech recognition systems end up classifying each segment of signal into one of 256 distinct cateogories.  Clearly, we must be very careful in choosing these categories to ensure that they capture the important distinctions in the signal and ignore the variations that do not matter.  The rest of this section describes how this task can be accomplished.

The first technique is to represent the signal as a sequence of segments.  There is a tradeoff in how much signal to capture in a single segment.  The larger the segment, the more data is available to make a classification, but the less sensitive the classification will be for representing the rapid transitions that are necessary to reliably recognise stops and other transient consonants.  A typical segment size would be 20ms (containing 200 samples when using a 10kHz sampling rate).  Rather than divide the signal into discrete segments, most systems use ovverlapping segments to make sure that segment boundaries doen't accidentally mask important rapid transitions.  A typical increment may be 10ms.

Now the task is to characterise the signal within each segment in a  way that captures the information that most reliably identifies particular speech sounds over wide range of conditions.  To simple measurements are

#### Overal Intensity
The intensity can be measured by the sum of the squares of the numbers in the segment, and it is a good indicator of whether the segment is part of a voiced phoneme, an unvoiced phoneme or silence.

#### Peak measurements
The average time between significant intensity peaks in the signal will tend to reflect the fundamental frequency in voiced speech.

Other measurements on a segment can be obtained from performing a spectral analysis, using a technique known as the Fast Fourier Transform to produce an analysis of the intensity of the signal at different frequencies.  The spectral analysis can be used for many purposes.  You might identify the peaks in the spectrum that tend to reflect the formants in voiced speech, and different spectral patterns will reflect different sorts of consonants.  Often the spectral analysis is used as an intermediate stage for additional processing to extract the key aspects of the spectrum.  For instance, you might try to fit a polynomial curve tot he frequency spectrum, and use the roots of the polynomial as an abstract representation of the spectrum.  A large number of techniques are used in the literature to reduce this information to a few key features.

The other major set of techniques for classifying a segment characterise how it relates to the preceding segment.  In particular, because of the overlap in segemnts and their relatively short duration, many adjacent segments are very similar to each other.  For instance, if a vowel sound lasts 100ms, there would beo 10 segments during that period.  The changes between the segments would be gradual.  With consonants such as stops, however, there is a dramatic change over a short period.  One segment might be characterisitic of silence and the net of a voiced consonant.

A technique used in many systems is **Linear Predictive Coding**.  This operates directly on the actual sample values, which we denote by x(1), x(2),and so on.  At a 10 kHz sampling rate, there are 10,000 of these values a second.  The technique involves finding a set of parameters $\alpha_k$, such that a current sample value, x(k), is predicted by the following formula in terms of the previous n samples

$$x(k)=\sum_{i=1}^n \alpha_i x(k-i)$$

The idea is that patterns in the speech signal will be reflected in the coefficients found.  To use the technique, a uniform set of coefficients must be used to estimate all the sample values within a single segment.  Of course, this means that there will be errors in the prediction of some samples.  The coefficients that minimize the error over the samples in a segement are used to characterize the segment.  The error term will tend to be small when the signal is  periodic,a s with voiced segments, and large with the signal is not periodic, as with unvoiced sigements.  For voiced segments, the coefficients can be used to produce quite reliable estimates of the formants in the signal.

In summary, segments can be classified by features based on information such as:

- the overal intensity of the signal
- the overal intensity in varisous frequency ranges
- the formants
- the rate of change of the above measurements between segments

The simplestsignal processing system might just measure the intensity in 5 to 10 different frequency ranges, while the most complex might use sophisticated algorithms to represent the shape of the frequency spectrum and how it changes over time.  In all cases the end result is a small set of values that forms the input to the speech recognition system.

The actual recognition is driven by methods of comparing templates of segments constructed from training data to the segments constructed from the new input.  To do this matching, some measure of similarity must be defined based on the features extracted for each segment.  To explore this, consider a very simple system that attempts to classify each segment as a voiced  phoneme, an unvoiced phoneme, or silence.  Say that each segment is represented by a vector of three numbers from (0 to 256) that represent the overal intensity of the signal and the intensity in frequency ranges, 100-600 Hz and 600-1500 Hz.  For example, a segment represented by the vector (210 140 60) would have an overall intensity of 210 with an intensity of 140 between 100 and 600 hz and 60 between 600 and 1500.  After training on a sample of labeled data, assume the following teplates are produced based on average values for each type of segment.

- voiced segments: (180 120 50)
- unvoiced segemnts: (50 10 10)
- silences: (30 15 10)

We will measure the similarity between an input segment and a template by measuring least squared difference.  The smaller the number, the more similar the patterns.  The input vector(210 140 60) would produce the following measures

- Difference from voiced $(210-180)^2+(140-120)^2+(60-50)^2=1,400$
- Difference from unvoiced: $(210-50)^2+(140-10)^2+(60-10)^2=45,000$
- Difference from silence: $(210-30)^2+(140-15)^2+(60-10)^2=50,525$

Thus this segment would be classified as a voiced phoneme.  An input segment represented as (40 10 12), on the other hand, would be classified as an unvoiced phoneme (with a score of 104 versus 129 for silence and 33,144 for voiced).

Of course, a speech recognition system will need a much more sophisticated representation of segments than this.  A typical vector size consists of 24 numbers, capturing various measures such as intensity, formants, and changes from the previous segment, as discussed earlier.  In addition, the similarity measures can be considerably more complicated, reflecting the fact that some measures are more important than others, or reflecting interdependencies between the measures.  A typical system uses 256 templates for classifying segments.  These templates define a set of symbols called a codebook.  There are automatic techniques for designing a codebook that optimizes recognition performance on a given set of training data.

### Statistical Recognition

Once the signal processing has reduced the signal to a sequence of symbols from the codebook, the speech recognition task looks more like a traditional parsing problem. Specifically, it is given a sequence of symbols that must identify the most likely sequence of words that could have generated that input.  This is similar to determining the best parts of speech for words in a sentance and suggests that HMM models can be used effectively.  This is in fact the case, and the most successful speech recognition systems today are based on HMM techniques.

An important issue to consider is at what level the HMM should be based.  One obvious possibility is the word.  For each word in the lexicon, we could define an HMM network that defines the likely sequences of codebook symbols that could realize the word.  For limited-vocabulary applications, this is a viable technique.  It is possible to obtain enough training examples for each word to define the networks.  The result is a robust recognition system.  But if you are considering large-vocabulary applications, it is difficult to find enough training data.  For instance, some speech recognition systems now have vocabulary sizes of 20,000 words or more.  Finding enough training data for such a vocabulary size is near to impossible.

Another problem arises in continuous speech applications.  A word may be realized differently depending on what words surround it.  These co-articulation effects an have significant effect on the realization of the word.  This is especially true with functions words such as *the*, which are highly influenced by the words that follow it.  To account for this data, you would need to train the models not only on individual words but also on pairs of words, making the training problem even more difficult.

Because of these difficulties, large vocabulary speech recognition systems are typically based on subword units.  The phoneme would seem to be a natural unit.  In a phoneme-based system, there would be a HMM network that defines the likely codebook sequences that realize each phoneme.  Even with a liberal set of phonemes that included common allophones, there would be less than 100 units that require training, making training even 5 minutes of speech data feasible.  The problem, as mentioned earlier, is that the realization of phonemes is highly dependent on the surrounding context.  Remember that the phoneme /t/ will look very different depending on the surrounding context.  One successful solution to this problem is to use triphones, or phonemes in context (PIC).  In this model there would be no HMM for the phoneme /d/, but there would be an HMM defined for the phone /d/ preceded by a silence and followed by the vowel /ee/, written as sil/d/ee, and another for the combinations of sil/d/I, I/d/sil, and so on.  Of course, the full set of triphones would consist of at least $40^3$ (64,000) PICs, so training would become again unmanageable.  Luckily, many combinations never occur in a given language, so the actual number is not nearly so big.  In addition, it is possible to collapse the preceding and following phonemes into more general classes that all affect the central phoneme in the same way.  By using these techniques, the number of phonemes in context that need to be modeled can be kept at a manageable size (on the order of 1000).  In addition, with this model there is a possibility of using smoothing techniques.  For instance, if a particular phoneme-in-context triple has never been observed, you might estimate its characteristics based on phoneme pare models, or a general model of the context independent phoneme.

Each PIC is represented as an HMM that encodes likely sequences of codebook symbols that realize the signal.  A simplified HMM model for a phoneme is shown in the figure below.  The state labeled S is the start state.  The state capture the normal progression through symbols of typical of the beginning part of the PIC (state B), then the middle part (state M), and finally the ending part (state E), and the phone is completed when it reaches state F.  Each node is associated with a different probability distribution for the codebook symbols (C1,..., C256).

![Figure 3](https://selene.hud.ac.uk/u1273400/images/seg_media/a5.PNG)

**Figure 3** Simplified HMM template for phonemes in context

Consider an example of how such an HMM model is trained.  For each PIC, a set of training instances are selected from the data.  Each of these is represented by a sequence of codebook symbols produced by the signal processing component.  For instance, for PIC sil/p/l, we might find examples of such as the following in the training data:

    C1 C1 C2 C1 C3 C3 C9 C3 C4 C7 C8
    C1 C2 C2 C4 C4 C3 C5 C9
    C2 C2 C3 C5 C5 C6 C3 C8 C7 C7
    C3 34 C3 C3 C5 C6 C6 C6 C8 C9
    C1 C2 C1 C4 C4 C4 C5 C8 C8
    
Using standard HMM training techniques, the system might then produce a model shown in figure below

![Figure 4](https://selene.hud.ac.uk/u1273400/images/seg_media/a6.PNG)

**Figure 4** HMM model for sil/p/l after training

The transition probabilities and the output probabilities have been chosen to locally otpimize the probability that the network would generate the five ovserved sequences.  Note that the HMM model provies a solution to a difficult problem that arises because words are oftn said at quite different rates. Thus a phoneme might sometimes be realised by 10 codebook symbols (100 ms) and other times by only 3 or 4 symbols (30 to 40 ms).  The arcs that return to the same state that they left allow the model to assign a probability of staying in the same rate for multiple input symbols.  Many systems however, find this model still too crude and augment the recognition system with additional mechanisms that explicitly model the expectation durations of various syllables.

In General, HMMs for phonemes must be more complicated in order to handle variations such as in very rapid speech.  For example, the basic HMM model used in SPHINX system shown in the figure below.

![Figure 5](https://selene.hud.ac.uk/u1273400/images/seg_media/a7.PNG)

**Figure 5** HMM model used in Sphinx system

Given a sequence of observations (expressed as symbols from the codebook), and a set of trained HMM models for each PIC, the Viterbi algorithm can be used to calculate the best path through each HMM representing a phoneme in context.  From this, we can calculate the probability of the sequence given each HMM, and hence identify the PIC with the highest probability of generating the sequence.

But this simple scheme would only work if the input was segmented into phonetic units that could be analysed one at at a time.  This is not possible in practice, so the recognition system must solve to problems at once: phoneme segmentation and phoneme recognition.  To constrain this task, you can use word definitions to predict what phoneme sequences are llikely.  For example for the word please could be spelled phonetically as /p/ /l/ /ee/ /z/.  To recognise this word the input would need to recognize the PICS sil/p/l  p/l/ee l/ee/z ee/z/sil.

Each of these is defined by an HMM,so we have a sequence of HMMs to account for the set of observations. We can construct an HMM for the word please by concatenating the PIC HMMs together, creating a new network as shown in the figure below.  The Viterbi algorithm using this network will then segment the observations into the best PIC boundaries as it finds the most likely state sequence.

![Figure 6](https://selene.hud.ac.uk/u1273400/images/seg_media/a8.PNG)

**Figure 6** Concatenated PIC HMMs to form one word for please

Ofcourse, there may be multiple pronounciations of the word please, so alternative phonetic spellings must be allowed as well.  If you wilsh to account for the different frequencies of alternate pronounciations, then an HMM network can be defined for each word that specifies the likelihood of various phoneme sequences in the realisation of the word.

This scheme will work for isolated words but will be inadequate for continuous speech.  In particular, two adjacent words might not have a silence between them.  This can then be handled by constructing a network that allows two different PIC sequences, one that includes a silence and one that doesn't.  The two possible phoneme sequences to model the word please send might be

- sil/p/l  p/l/ee   l/ee/z ee/z/sil  sil/s/$\epsilon$ s/$\epsilon$/n  $\epsilon$/n/d n/d/sil
- sil/p/l  p/l/ee   l/ee/z ee/z/s  z/s/$\epsilon$ s/$\epsilon$/n  $\epsilon$/n/d n/d/sil

To model the likely word sequences, speech recognition systems typically use bigram model of syntax.  This model defines another level of HMMs, which can be used to model the category transitions yeilding probabilities of word sequences.  The use of a bigram model can greatly increase accuracy of speech recognition system, as it eliminates many possibilities that would have caused and ambiguity.


### Speech Recognition and NLU

As you have just seen, augmenting a speech recognition system with a bigram word model significantly improves the performance of the system.  This suggests that a more comprehensive model would produce even better performance.  In practice however, this is difficult to accomplish.  Word trigrams could be used, but this would require significantly more data. Directly integrating a probabilistic context-free grammar also poses difficulties.  First, speech systems currently achieve an elegant integration of the bigram models, word models, and phoneme models because all can be represented within the same framework, namely HMMs. Introducing a context-free grammar grammar formalism for the syntactic component would cause a lack of integration that could adversely affect the recognition accuracy or the efficiency of recognition.  As a result, all current spoken language understanding systems maintain a strict separation between the speech recognizer and the NLU system.

Accepting this division, there are still many options for designing the interface.  The simplest interface, and the one common use, is one which the speech recogniser outputs the single best sequence of words it can find.  The language system then processes this and hopes that there are no serious recognition errors.  A generalisation of this schee is called N-best, where the speech recogniser outputs the N best sequences often are substantially the same, with differences with a word or two.  Thus, if the speech recogniser has made an error in recognition on a particular word, it is likely that the same error occurs in all the N-best alternatives.

An interesting alternative to N-best is a word-lattice output.  In this approach the recognition system would output a lattice of the most likely words in the input.  A word lattice provides a compact representation of a large number of possible sentences, creating a substantially richer environment for error recovery basded on parser and semantic interpretation.  Note that can view a word lattie as an initial chart parser.  The fact that there are multiple alternatives for what words appear at what positions does not affect the basic parsing algorithm. 

Such general techniques have not been pursued extensively, as the current generation systems use domain-specific techniques to optimize their short-term performance.  Domain specific applications can be used to correctly interpret a query even when large protions of the input is misrecognised.  Researchers have found it more effective in the short term to improve the domain-specific interpretation heuristics than to explare more general and robust interfaces between the speech recognizer and the natural language system.  This could change as the applications become more complex. 

### Prosody and Intonation

There is one further aspect of speech that is extremely important for recognition and language understanding.  This is prosody: a collection of phenomena relating to how sentences are spoken rather than what is spoken.  It consists of a perceptual cues such as:

1. **the intonation contour** - the fundamental frequency (or pitch) rises and falls throughout the utterance.
2. **the speech rate** - a speaker continually varies the speech rate, elongating some syllables, shortening others, adding pauses between words, and so on.
3. **the intensity** - a speaker also varies the intensity or loudness during an uttereance, emphasising certain syllables whilst deemphasising others.

Prosody is critically important for spoken language understanding because it conveys how the speaker relates to what is being said.  In some cases it plays the role that punctuation plays in written language or vice-versa.  Intonation however, is much richer than punctuation for example the phrase 'You are coming to the party' can either be an invation, question or indication of surprise. In written grammar all three would be indicated using a question mark.

Another important aspect of spoken language is stress. All sentences have a typical stress pattern that is normally used.  but the speaker can vary the stress in order to indicate what is being focused on or to indicate constrasts and corrections.  Stress in speech is signaled by increased intensity and by factors such as lengthening of a word and the use of intonational patterns, such as temporary rise in speech at the stressed word.  In written language, stress is sometimes indicated by putting the words in all uppercase letters, as in

    I only took TWO candies.  (that is, not three)
    I only took two CANDIES. (that is, no cookies)

Consider possible answers to the question *Do you own a red car?*

    No, but I own a red BIKE.
    No, but I own a BLUE car.
    
It would be very strange to answer the questions with the sentences

    No, but I own a RED bike.
    No, but I own a blue CAR.
    
The reason is that the stress emphasizes new and contrastive information.  It makes no sense to stress words that are already expected given the context.  

The intonational contour and speech rate are also important in phrasing, indicating where one phrase or sentence ends and another begins.  Here is a sequence of words uttered by a single person in a dialogue.  Without prosodic information, you cannot tell where the phrase and sentence boundaries are:

    Not at the same time OK we're gonna hook up engine E2 to the boxcar at Elmira and sendthat off to Corning now while we're loading that boxcar with oranges at Corning we're gonna take engine E3 and send it over to Corning hook it up to the tanker car and send it back to Elmira.

Without prosodic information, this sequence of words is highly ambiguous as to where utterance boundaries are located.  Consider the word now in the second line.  It might be a temporal adverbial, part of the utterance.

    Send that off to Corning now while we're loading the boxcar
    
Alternatively, it might be a discourse cue word and its own separate utterance, yielding this sequence of utterances.

    Send that off to Corning
    Now
    While we're loading loading the boxcar, ...
    
The different utterance segmentations yield radically different interpretations, yet you cannot tell from the words which one is intended.  If you could listen to this in is original spoken form, however, it would be obvious that its structure is the latter interpretation.

While prosody clearly plays a crucial roale in spoken dialogue, computational methods have not been extensively studied.  One reason for this is that current speech recognition systems are designed to handle a single utterance at a time (that is, the sentence boundaries are determined externally) and have very limited dialogue capabilities.  As a result for current applications.  Some work has indicated that prosodic analysis for current applications.  Some work has indicated that prosodic analysis can improve the accuracy of speech recognition, but no system has yet integrated prosodic processing in any significant way.

One place where prosody is of crucial importance is in speech synthesis.  A system that synthesizes speech with no prosodic phrasing sounds extremely artificial and is difficult to understand.  So the best speech synthesis systems all use heuristic techniques for determining a prosodic contour, typically based on statistical methods for determinig phrase boundaries for written text.  It is important that these methods work well, as a synthesizer using inappropriate prosody would be much worse than one that doesn't use prosody at all.

## References

Allen, J. (1994). Natural language understanding (2nd ed.). Redwood City, Calif: Benjamin/Cummings

Holmes, J. N., & Holmes, W. (2001). Speech synthesis and recognition (2nd ed.). New York: Taylor & Francis.