Skip to content

Commit fe2bfaa

Browse files
committed
fix minhash
1 parent e18c4b0 commit fe2bfaa

1 file changed

Lines changed: 17 additions & 57 deletions

File tree

content/docs/Recsys/minhash.md

Lines changed: 17 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -5,19 +5,16 @@ title: Implementing Minhash
55

66
# Chapter 3 of Mining Massive DataSets
77

8-
[Chapter](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf)
8+
[Chapter](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) [Slides](http://www.mmds.org/mmds/v2.1/ch03-lsh.pdf)
99

10-
[Slides:](http://www.mmds.org/mmds/v2.1/ch03-lsh.pdf)
11-
12-
**Motivating the Chapter **
10+
**Motivating the Chapter**
1311

1412
We cover the first part of this chapter, which deals with Jaccard similarity, shingling, and minhash.
1513

1614
Often these days data analysis involves datasets that have high dimensionality, meaning the data set in question has more features than values and to make statistically sound inferences at scale, require large amounts of data ([more info here, on p. 22](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)), and it’s these kinds of datasets that Chapter 3 deals with.
1715

1816

19-
![minhash](https://raw.githubusercontent.com/veekaybee/boringml/master/static/images/minhash1.png)
20-
17+
![minhash](https://raw.githubusercontent.com/veekaybee/boringml/main/static/images/minhash1.png)
2118

2219

2320
What kinds of datasets have these features? In the world of recommendation systems, most any kind of content we’d like to think about recommending, such as text and images, [will be high-dimensional spaces.](https://towardsdatascience.com/understanding-high-dimensional-spaces-in-machine-learning-4c5c38930b6a) One important ability in dealing with these types of large datasets is to be able to find similar items, in fact it’s this that underlies the principles of offering recommendations, i.e. how similar is this post to this post, how can we recommend things that are similar to other things we know for sure the user likes?
@@ -26,36 +23,27 @@ Before we recommend something, we have to see whether two items are similar. I p
2623

2724
**Jaccard Similarity**
2825

29-
This chapter of MMDS specifically deals with [minhash](http://v), one method of combing through millions of items and evaluating how similar they are based on some definition of “distance” between two sets, or groups of items. It was a technique initially used to dedup search results for AltaVista. It can also be used for recommending similar images or detecting plagiarism. It’s mostly used in the context of comparing groups of (text-based) documents. [Something that’s important to keep in mind](https://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/) is that we’re not actually looking for the meaning of the sets, just whether these documents are similar on a purely textual level.
26+
This chapter of MMDS specifically deals with minhash, one method of combing through millions of items and evaluating how similar they are based on some definition of “distance” between two sets, or groups of items. It was a technique initially used to dedup search results for AltaVista. It can also be used for recommending similar images or detecting plagiarism. It’s mostly used in the context of comparing groups of (text-based) documents. [Something that’s important to keep in mind](https://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/) is that we’re not actually looking for the meaning of the sets, just whether these documents are similar on a purely textual level.
3027

3128
The chapter starts out by introducing Jaccard similarity, a metric that we can use to determine whether any given two sets of items are similar, for the mathematical definition of set.
3229

3330
Jaccard similarity is the similarity of sets by looking at the relative size of their intersection divided by their union, or SIM(S,T) = |S ∩ T| / |S U T|. So imagine this is two collections of documents: it could be two web pages and the sentences in their pages, or two directories full of pictures, etc.
3431

3532

36-
![minhash2](https://raw.githubusercontent.com/veekaybee/boringml/master/static/images/minhash2.png)
33+
![minhash2](https://raw.githubusercontent.com/veekaybee/boringml/main/static/images/minhash2.png)
3734

3835

3936
Here’s the implementation of Jaccard similarity in [Python and Scala](https://gist.github.com/veekaybee/f31274222ce85f7005b29f78df3de34d).
4037

41-
This is only good for one set of items, though, and doesn’t scale well if the sets are quite large. So we use minhashing, which gets close enough to approximating Jaccard similarity that we can say with confidence that two sets are alike or not alike.
42-
43-
As the [minhash paper says](http://cs.brown.edu/courses/cs253/papers/nearduplicate.pdf),
44-
45-
46-
However, for efficient large scale web indexing it is not necessary to de-
47-
38+
{{ <gist veekaybee f31274222ce85f7005b29f78df3de34d>}}
4839

49-
termine the actual resemblance value: it suffices to determine whether
50-
51-
52-
newly encountered documents are duplicates or near-duplicates of docu-
5340

41+
This is only good for one set of items, though, and doesn’t scale well if the sets are quite large. So we use minhashing, which gets close enough to approximating Jaccard similarity that we can say with confidence that two sets are alike or not alike.
5442

55-
ments already indexed. In other words, it suffices to determine whether
43+
As the [minhash paper says](http://cs.brown.edu/courses/cs253/papers/nearduplicate.pdf),
5644

5745

58-
the resemblance is above a certain threshold.
46+
However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffices to determine whether the resemblance is above a certain threshold.
5947

6048
If this sounds familiar to you, it may be because you’re already familiar with datasketches, the family of probabilistic data structures that create quick glances at a large amount of data and tell you with some degree of certainty that items are the same or not the same, or can introspect a set for certain properties. (Bloom filters and HyperLogLog are examples.)
6149

@@ -65,10 +53,7 @@ So, in order to compare documents, we need to create sets of them that we can fi
6553

6654
Here’s a good representation of how this works, [from this course](https://www.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf). Imagine each set is a single piece of paper with several numbers on it:
6755

68-
<p id="gdcalert3" ><span style="color: red; font-weight: bold">>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary. </span><br>(<a href="#">Back to top</a>)(<a href="#gdcalert4">Next alert</a>)<br><span style="color: red; font-weight: bold">>>>>> </span></p>
69-
70-
71-
![minhash3](https://raw.githubusercontent.com/veekaybee/boringml/master/static/images/minhash3.png)
56+
![minhash3](https://raw.githubusercontent.com/veekaybee/boringml/main/static/images/minhash3.png)
7257

7358

7459
For word-based documents, though, we need to get at the letter representations of a set. So, instead of using individual numbers, we use shingles, which are really just short strings of any number of letters.
@@ -79,13 +64,11 @@ any substring of length k found within the document. So, Suppose our document D
7964

8065
K can be almost any number we want, but at some point, there will be an optimal number where we don’t get a sparse matrix that’s too large to compute or a matrix that’s too small where the similarity between sets is too high. In this way, picking K is similar to picking K for clustering algorithms where you use the highly scientific method of the [elbow method until it looks right. ](https://en.wikipedia.org/wiki/Elbow_method_(clustering))
8166

82-
**Minhash **
67+
**Minhash**
8368

8469
Once we have K, we can set up the matrix. And now we minhash. Here, for example, k is 1.
8570

86-
87-
88-
![minhash3](https://raw.githubusercontent.com/veekaybee/boringml/master/static/images/minhash4.png)
71+
![minhash3](https://raw.githubusercontent.com/veekaybee/boringml/main/static/images/minhash4.png)
8972

9073

9174

@@ -94,46 +77,23 @@ First, we pick a permutation of rows. What this means in English is that we just
9477
Then, we create the new matrix, indicating whether each letter is in each set in the same position. For example, if we mix up the letters like this, the rows are b,e,a,d,and c instead of alphabetical order.
9578

9679

97-
98-
<p id="gdcalert5" ><span style="color: red; font-weight: bold">>>>>> gd2md-html alert: inline image link here (to images/image5.png). Store image on your image server and adjust path/filename/extension if necessary. </span><br>(<a href="#">Back to top</a>)(<a href="#gdcalert6">Next alert</a>)<br><span style="color: red; font-weight: bold">>>>>> </span></p>
99-
100-
101-
![minhash5](https://raw.githubusercontent.com/veekaybee/boringml/master/static/images/minhash5.png)
80+
![minhash5](https://raw.githubusercontent.com/veekaybee/boringml/main/static/images/minhash5.png)
10281

10382

10483
And we can see, from the first diagram, that for S1, b is 0, so the value of that index is 0, and so on.
10584

10685
Now for the actual minhash function, we keep going down the row until we hit the first 1 value. For S1, that value is a:
10786

108-
column, which is the column for set S1, has 0 in row b, so we proceed to row e,
109-
110-
the second in the permuted order. There is again a 0 in the column for S1, so
111-
112-
we proceed to row a, where we find a 1. Thus. h(S1) = a. And likewise, we see
113-
114-
that h(S2) = c, h(S3) = b, and h(S4) = a.
87+
column, which is the column for set `S1`, has 0 in row b, so we proceed to row e the second in the permuted order. There is again a 0 in the column for S1, so we proceed to row a, where we find a 1. Thus. `h(S1) = a`. And likewise, we see that `h(S2) = c`, `h(S3) = b`, and `h(S4) = a`.
11588

11689
Now we get to the connection between minhash and Jaccard similarity:
11790

118-
119-
The probability that the minhash function for a random permutation of
120-
121-
122-
rows produces the same value for two sets equals the Jaccard similarity
123-
124-
125-
of those sets
91+
The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets
12692

12793
This is really important, because it allows us to use Jaccard similarity as a substitute for manually calculating computations between all the rows/columns of a very large document matrix. And,
12894

12995

130-
Moreover, the more minhashings we use, i.e., the more rows in the sig-
131-
132-
133-
nature matrix, the smaller the expected error in the estimate of the Jaccard
134-
135-
136-
similarity will be
96+
Moreover, the more minhashings we use, i.e., the more rows in the signature matrix, the smaller the expected error in the estimate of the Jaccard similarity will be
13797

13898
Here’s a good explanation of how this [works with a little more detail than MMDS.](https://www.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf)
13999

@@ -145,7 +105,7 @@ a random hash function that maps row numbers to as many buckets as there
145105

146106
are rows. Thus, instead of picking n random permutations of rows, we pick n randomly chosen hash functions h1, h2, . . . , hn on the rows. We construct the signature matrix by considering each row in their given order, and then we look across the rows.
147107

148-
![minhash5](https://raw.githubusercontent.com/veekaybee/boringml/master/static/images/minhash6.png)
108+
![minhash5](https://raw.githubusercontent.com/veekaybee/boringml/main/static/images/minhash6.png)
149109

150110

151111
You can have as many hash functions as you want, but each one will generate a specific number. Get the minimum number of a single hash function, apply across as many as you have, and you’ll get a unique ID for your permutation and then you can compare them across document sets.

0 commit comments

Comments
 (0)