Implement the Hamming and Levenshtein (edit) distances and compute tem for the for the following word pairs. It may help to compute them by hand first.
Aside from computing the distance (which is a scalar), do the backtrace and compute the edit transcript (and thus alignment).
Please use 1-3_autocorrect.py.
def hamming(s1: str, s2: str) -> int:
pass
def levenshtein(s1: str, s2: str) -> (int, str):
passFor spelling correction, we will use prior knowledge, to put some learning into our system.
The underlying idea is the Noisy Channel Model, that is: The user intends to write a word w, but through some noise in the process, happens to type the word x.
The correct word
Note: use eg.
pandoc -s -o README.html README.me --mathjaxto get the math rendered properly...
- The candidates
$V$ can be obtained from a vocabulary. - The probability
$P(w)$ of a word$w$ can be learned (counted) from data. - The probability
$P(x|w)$ is more complicated... It could be learned from data, but we could also use a heuristic that relates to the edit distance, e.g. rank by distance.
You may use Peter Norvig's count_1w.txt file, local mirror. Note that it may help to restrict to the first 10k words to get started.
def suggest(w: str, dist, max_cand=5) -> list:
passPlease use 1-3_autocorrect.py.
- How would you modify the implementation so that it works fast for large lexica (eg. the full file above)?
- How would you modify the implementation so that it works "while typing" instead of at the end of a word?
- How do you handle unknown/new words?
In the parts 1 and 2 above, we applied uniform cost to all substitutions. This does not really make sense if you look at a keyboard: the QWERTY layout will favor certain substitutions (eg. a and s), while others are fairly unlikely (eg. a and k).
Implement the Needleman-Wunsch algorithm which is very similar to the Levenshtein distance, however it doesn't minimize the cost but maximizes the similarity.
For a good measure of similarity, implement a metric that computes a meaningful weight for a given character substitution.
How does your suggest behave using this metric?
def nw(s1: str, s2: str, d: float, sim) -> float:
passPlease use 1-3_autocorrect.py.
- What could be better heuristics for similarity resp. cost of substitution than the one above?
- What about capitalization, punctiation and special characters?
- What about swipe-to-type?
Due to the relatively large sample number (e.g. 8kHz), performing DTW on the raw audio signal is not advised (feel free to try!). A better solution is to compute a set of features; here we will extract mel-frequency cepstral coefficients over windows of 25ms length, shifted by 10ms. Recommended implementation is librosa.
Please use 4_iwr.py.
Download Zohar Jackson's free spoken digit dataset.
There's no need to clone, feel free to use a revision, like v1.0.10.
File naming convention is trivial ({digitLabel}_{speakerName}_{index}.wav); let's restrict to two speakers, eg. jackson and george.
DTW is closely related to Levenshtein distance and Needleman-Wunsch algorithm. The main rationale behind DTW is that the two sequences are can be aligned but their speed and exact realization may very. In consequence, cost is not dependent on an edit operation but on a difference in observations.
To verify your implementation, perform the following Experiment: For each speaker and digit, select one recording as reference and one as test.
- Compute the DTW scores between each of the files.
- How do the values compare within and across speaker?
def dtw(obs1: list, obs2: list, sim) -> float:
passdef recognize(obs: list, refs: dict) -> str:
"""
obs: input observations (mfcc)
refs: dict of (classname, observations) as references
returns classname where distance of observations is minumum
"""
pass- What are inherent issues of this approach?
- How does this algorithm scale with a larger vocabulary, how can it be improved?
- How can you extend this idea to continuous speech, ie. ?
Dual-tone multi-frequency DTMF signaling is an old way of transmitting dial pad keystrokes over the phone.
Each key/symbol is assigned a frequency pair: [(1,697,1209), (2,697,1336), (3,697,1477), (A,697,1633), (4,770,1209), (5,770,1336), (6,770,1477), (B,770,1633), (7,852,1209), (8,852,1336), (9,852,1477), (C,852,1633), (*,941,1209), (0,941,1336), (#,941,1477), (D,941,1633)].
You can generate some DTMF sequences online, eg. https://www.audiocheck.net/audiocheck_dtmf.php
For feature computation, use librosa to compute the power spectrum (librosa.stft and librosa.amplitude_to_db), and extract the approx. band energy for each relevant frequency.
Note: It's best to identify silence by the overall spectral energy and to normalize the band energies to sum up to one.
To decode DTMF sequences, we can use again dynamic programming, this time applied to states rather than edits.
For DTMF sequences, consider a small, fully connected graph that has 13 states: 0-9, A-D, *, # and silence.
As for the DP-matrix: the rows will denote the states and the columns represent the time; we will decode left-to-right (ie. time-synchronous), and at each time step, we will have to find the best step forward.
The main difference to edit distances or DTW is, that you may now also "go up" in the table, ie. change state freely.
For distance/similarity, use a template vector for each state that has .5 for those two bins that need to be active.
When decoding a sequence, the idea is now that we remain in one state as long as the key is pressed; after that, the key may either be released (and the spectral energy is close to 0) hence we're in pause, or another key is pressed, hence it's a "direct" transition.
Thus, from the backtrack, collapse the sequence by digit and remove silence, eg. 44443-3-AAA becomes 433A.
def decode(y: np.ndarray, sr: float) -> list:
"""y is input signal, sr is sample rate; returns list of DTMF-signals (no silence)"""
passPlease use 5_dtmf.py.


