Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [classlib] add fuzzy array comparisons #4468

Open
wants to merge 4 commits into
base: develop
from

Conversation

Projects
None yet
3 participants
@jrsurge
Copy link
Member

commented Jun 27, 2019

Purpose and Motivation

To add fuzzy string comparisons.

EDIT: this PR has changed focus to providing more general, fuzzy Array comparisons

Initial motivation came from the work on choosing an audio device for Windows, I started thinking if we could work out the most similar string for a device name to make guessing a little smarter, accounting for typos etc.

The core of this is fuzzy string comparison, so I added it to the language.

It's very easy to take these new methods, and write a quick check for nearest string, comparing each word to find the most relevant match (or if agreed, add the methods below for this).

For this to apply to the device selection, we'd need a way to return an array of device name strings on all platforms, but fuzzy string comparison itself could prove useful to the language even without this, so here it is.

This PR adds the following instance methods to the String SequenceableCollection class:

  • editDistance
    Alternatively known as Levenshtein Distance, this is the minimum amount of changes required to convert one string into another.
  • similarity
    This uses the editDistance to calculate the percentage similarity (scaled between 0-1) of the two strings (0 is completely different, 1 is exactly the same)
    - isSimilar
    This uses the similarity, and the given delta to return a bool if the two strings are at least as similar as the delta value.

Examples

"hello".editDistance("hallo"); // 1
"word".similarity("wodr"); // 0.5

### Additional work

If possible, I'd also like to add a numSimilarWords and nearestString methods, as follows:
EDIT: commented out

Types of changes

  • Documentation
  • New feature

To-do list

  • Code is tested
  • All tests are passing
  • Updated documentation
  • This PR is ready for review
    - [ ] Add additional methods?

@jrsurge jrsurge changed the title Topic/sclang add fuzzy string comparisons [classlib] add fuzzy string comparisons Jun 27, 2019

@jrsurge jrsurge force-pushed the jrsurge:topic/sclang-add-fuzzy-string-comparisons branch 2 times, most recently from 976cc09 to 4f69842 Jun 27, 2019

@muellmusik

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

@jrsurge jrsurge force-pushed the jrsurge:topic/sclang-add-fuzzy-string-comparisons branch 3 times, most recently from d59fe1f to 62767fe Jun 27, 2019

@brianlheim brianlheim self-requested a review Jun 27, 2019

@brianlheim

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

This is great @jrsurge!

Just to make this clear first (so the following makes more sense), in our current implementation Strings are just arrays of bytes. Multi-byte character codes from UTF-8 are considered as multiple separate bytes, and so the edit distance between "你好" and "" is 4, not 2.


With that in mind, a couple thoughts:

I think the idea of "words" is not well-defined enough for us to use it in the core library. Depending on your purpose and language, what delimits a word could change drastically. Here, you're just splitting by space; what about newlines and other whitespace? Is "my.device" considered one word or two? What about "cat's" vs "cat‘s", "cat-dog" vs "cat--dog" vs "cat–dog" (en-dash)? Should you include surrounding punctuation when you split? I would expect the words in "this... is(a dog)" to be [this, is, a, dog], while under the current implementation it's [this..., is(a, dog)]. Then there is also a bit of inherent ethnocentrism in that this method works best if your string is all characters from the ASCII set and in a language where words are delimited by spaces.

Some of these points can't be fully addressed until we have the ability to iterate through strings as sequences of Unicode code points. Even then, I'd be skeptical, because there just isn't a definition that will work for most or even all situations when it comes down to the corner cases. I would rather not have our core library take a position on that, OR offer a watered down version of it whose name reflects the more salient biases/assumptions it makes.

So, I'm a no on numSimilarWords and nearestString. I think these should either (1) be written in the place they're needed with a very clear notice about the assumptions made, or (2) be reworked so they are more generic (although currently I'm skeptical they can be).

I could see a case for a less opinionated version of nearestString that just selects the argument with the lowest edit distance. In fact, that was what I expected when I first saw the name. And this is likely to be a common usage, since the primary usage I see for edit distance is when you need to pick which of the things in a list the user probably meant to type.

Note that this will still choose the right device in your example:

(
var nearestString = { |s, l| l[l.maxIndex { |x| x.similarity(s) }] }; // sorry :<
nearestString.("awesoem", [
	"Some Device That's Not Great",
	"Some Good Device That's Not What We Want",
	"Another Amazing Device Name",
	"My Awesome Device Name"
]) // ->  My Awesome Device Name
)

I'm also a little skeptical on the utility of isSimilar. The first line is clearer to me, and almost the same number of characters:

"abc".similarity("def") > 0.7
"abc".isSimilar("def", 0.7)

You lose a bit of semantic value when you hide the comparison behind the function interface, and the name isSimilar doesn't give much of a hint about what the second argument means.


I think we should supply these for the most abstract class possible, SequenceableCollection. I can copy and paste your exact definitions for editDistance and similarity there, and they work just fine. :)


Finally, Levenshtein Distance is IMO worth making a C++ function and primitive. This could be super useful in the sclang compiler, and in IDE code, so it'd be nice to have a single (fast) implementation. In addition, this is a function I can see people wanting to potentially call on very large strings; it'd be nice if this weren't the bottleneck in case people want to, for instance, compute the edit distance between two small audio files. If that's done, we may need one primitive for SequenceableCollection and a separate one for ArrayedCollection. What do you think?

@brianlheim
Copy link
Member

left a comment

See comment above

@jrsurge

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2019

So, I'm a no on numSimilarWords and nearestString. I think these should either (1) be written in the place they're needed with a very clear notice about the assumptions made, or (2) be reworked so they are more generic (although currently I'm skeptical they can be).

Yup, this is exactly why I didn't include it in the initial PR - they didn't feel general enough.

I could see a case for a less opinionated version of nearestString that just selects the argument with the lowest edit distance. In fact, that was what I expected when I first saw the name. And this is likely to be a common usage, since the primary usage I see for edit distance is when you need to pick which of the things in a list the user probably meant to type.

Note that this will still choose the right device in your example:

(
var nearestString = { |s, l| l[l.maxIndex { |x| x.similarity(s) }] }; // sorry :<
nearestString.("awesoem", [
	"Some Device That's Not Great",
	"Some Good Device That's Not What We Want",
	"Another Amazing Device Name",
	"My Awesome Device Name"
]) // ->  My Awesome Device Name
)

In this example, that's true. But I noticed it gets massively skewed with mixtures of long and short strings. With a short search string, and an array with longer strings and a short string, checking with the entire string, instead of 'words', becomes problematic. It will almost always choose the shortest string because it involves the least manipulation.
But, it should be more general, I definitely agree.

I'm also a little skeptical on the utility of isSimilar. The first line is clearer to me, and almost the same number of characters:

"abc".similarity("def") > 0.7
"abc".isSimilar("def", 0.7)

You lose a bit of semantic value when you hide the comparison behind the function interface, and the name isSimilar doesn't give much of a hint about what the second argument means.

This is fair. I liked it because it felt more concise when checking similarity, but it really is just a wrapper around "abc".similarity("def") > 0.7

I think we should supply these for the most abstract class possible, SequenceableCollection. I can copy and paste your exact definitions for editDistance and similarity there, and they work just fine. :)

Somewhat terrifying, I hadn't really thought about this given the expected use case, but that's super cool.

Finally, Levenshtein Distance is IMO worth making a C++ function and primitive. This could be super useful in the sclang compiler, and in IDE code, so it'd be nice to have a single (fast) implementation. In addition, this is a function I can see people wanting to potentially call on very large strings; it'd be nice if this weren't the bottleneck in case people want to, for instance, compute the edit distance between two small audio files. If that's done, we may need one primitive for SequenceableCollection and a separate one for ArrayedCollection. What do you think?

I thought about this, but on checking the speed in sclang, I was pretty impressed. I didn't really test anything massive though.
For massive strings, or moving to SequenceableCollection/ArrayedCollection, this might be more of an issue, so a primitive might be better in the long run.

I am, however, somewhat unsure how to write one.
If that's the direction we'd prefer, I'll go and consult the book?

@brianlheim

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

In this example, that's true. But I noticed it gets massively skewed with mixtures of long and short strings. With a short search string, and an array with longer strings and a short string, checking with the entire string, instead of 'words', becomes problematic. It will almost always choose the shortest string because it involves the least manipulation.
But, it should be more general, I definitely agree.

Ok, yes I can see that. But, what exactly is the goal here? IIUC, the goal is to make it easy to choose something from a list by typing enough characters to disambiguate the element you want. If that's the case, I have some ideas for other algorithms that might be better.

I thought about this, but on checking the speed in sclang, I was pretty impressed. I didn't really test anything massive though.
For massive strings, or moving to SequenceableCollection/ArrayedCollection, this might be more of an issue, so a primitive might be better in the long run.

I definitely think so. Here's a benchmark. Note that we probably can't achieve exactly these speeds in c++, but for raw array types we should be able to come close. For object types it will of course be a lot slower even in C++ because of the need to invoke == on the objects rather than being able to compare the values directly.

In sclang, using your implementation, release build:

(
Array.geom(12, 4, 2).do { |n|
	n.post;
	' '.post;
	bench { String.fill(n, $a).editDistance(String.fill(n, $b)) }
}
)

/*
4
time to run: 7.4634999918999e-05 seconds.
8
time to run: 0.00019521800004441 seconds.
16
time to run: 0.00066588400000001 seconds.
32
time to run: 0.0023637589999908 seconds.
64
time to run: 0.0062601669999367 seconds.
128
time to run: 0.030553861000044 seconds.
256
time to run: 0.11677848900013 seconds.
512
time to run: 0.443587944 seconds.
1024
time to run: 1.2803825400001 seconds.
2048
time to run: 4.8449122709999 seconds.
4096
time to run: 19.728719459 seconds.
8192
time to run: 77.879591756 seconds.
*/

in C++, using the Rosetta code implementation:

[brianheim@BrianMBP lsd]$ for i in `seq 2 16`; do echo $((2 ** $i)); time ./main $((2 ** $i)); done
4
distance: 4

real	0m0.004s
user	0m0.001s
sys	0m0.002s
8
distance: 8

real	0m0.003s
user	0m0.001s
sys	0m0.001s
16
distance: 16

real	0m0.003s
user	0m0.001s
sys	0m0.001s
32
distance: 32

real	0m0.003s
user	0m0.001s
sys	0m0.001s
64
distance: 64

real	0m0.003s
user	0m0.001s
sys	0m0.001s
128
distance: 128

real	0m0.003s
user	0m0.001s
sys	0m0.001s
256
distance: 256

real	0m0.003s
user	0m0.001s
sys	0m0.002s
512
distance: 512

real	0m0.003s
user	0m0.001s
sys	0m0.001s
1024
distance: 1024

real	0m0.005s
user	0m0.003s
sys	0m0.001s
2048
distance: 2048

real	0m0.009s
user	0m0.007s
sys	0m0.002s
4096
distance: 4096

real	0m0.025s
user	0m0.023s
sys	0m0.001s
8192
distance: 8192

real	0m0.088s
user	0m0.086s
sys	0m0.002s
16384
distance: 16384

real	0m0.337s
user	0m0.334s
sys	0m0.002s
32768
distance: 32768

real	0m1.345s
user	0m1.341s
sys	0m0.003s
65536
distance: 65536

real	0m5.373s
user	0m5.366s
sys	0m0.005s

Note that 32K in C++ takes about the same time as it does for 1K in sclang.

There are also different ways of computing Levenshtein difference that could be useful in different situations (https://en.wikipedia.org/wiki/Levenshtein_distance#Computing_Levenshtein_distance). There's one that uses a reduced memory footprint (O(n) instead of O(n^2)) and also an approximation algorithm. For very large arrays that's probably best. However, there's no guarantee those are going into the library unless someone is motivated to write them, so if we just have this one implementation, it would be great to have one that is as flexible and powerful as possible.

I am, however, somewhat unsure how to write one.
If that's the direction we'd prefer, I'll go and consult the book?

N.B. - James and I have been discussing this in private

@jrsurge jrsurge changed the title [classlib] add fuzzy string comparisons [WIP] [classlib] add fuzzy string comparisons Jun 30, 2019

@jrsurge

This comment has been minimized.

Copy link
Member Author

commented Jun 30, 2019

Thanks to Brian for all his time and help with this.

For anyone interested, I have a working branch with an initial outline of the primitive version.
Currently, it can't compare objects (any advice on that appreciated), but everything else should work.

https://github.com/jrsurge/supercollider/tree/topic/sclang-add-fuzzy-string-comparisons-working

jrsurge added some commits Jun 27, 2019

[classlib] add fuzzy comparisons to SequenceableCollection
this commit adds the following methods to allow fuzzy comparisons:
- editDistance
- similarity

ArrayedCollection uses a primitive for speed, other SequenceableCollections
use an sclang implementation
[sclang] add ArrayLevenshteinDistance primitive
This adds:
- SC_Levenshtein.h
Header-only, templated functor for calculating Levenshtein distances
between arrays.
- _ArrayLevenshteinDistance
Primitive for sclang to calculate edit distances
Checks if homogeneous array, if it is, uses a faster comparison, if it isn't, uses slot-by-slot comparison.
[unittest] add unit tests for fuzzy comparisons
Adds tests to TestArray to test primitive
[docs] add help for fuzzy comparisons.
- SequenceableCollection
- String (a note, referencing SequenceableCollection help)

@jrsurge jrsurge force-pushed the jrsurge:topic/sclang-add-fuzzy-string-comparisons branch from 62767fe to 04b5ec8 Jul 2, 2019

@jrsurge

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2019

Based on Brian's suggestions, this is now implemented as a primitive.

The codebase now has SC_Levenshtein.h which is a header-only templated functor to calculate the distance between two arrays.

As is currently stands, all SequenceableCollections have access to editDistance and similarity.

ArrayedCollections (Array, String, RawArray etc.) use the primitive to calculate editDistance, but all other SequenceableCollections use the sclang-based implementation.

Any thoughts greatly appreciated.

@jrsurge jrsurge changed the title [WIP] [classlib] add fuzzy string comparisons [WIP] [classlib] add fuzzy array comparisons Jul 2, 2019

@brianlheim brianlheim self-requested a review Jul 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.