This notebook allows calculating the values for `b` (the number of bands) and `r` (the number of minhashes in a band) used in the fuzzy dedup algorithm. The default values are `b=14` and `r=8`, as defined in the [FineWeb datasets paper](https://arxiv.org/pdf/2406.17557). The x-axis of the graph represents the Jaccard similarity between a pair of documents, while the y-axis represents the probability that they become duplication candidates. Please refer to http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf for more details on this methodology.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define the parameterized function
def f(s, r, b):
    return 1 - (1 - s**r)**b

# Set the parameters r and b
r = 8
b = 14

# Generate values for s in a range, e.g., from 0 to 1
s_values = np.linspace(0, 1, 500)  # 500 points between 0 and 1
f_values = f(s_values, r, b)

# Plot the function
plt.figure(figsize=(8, 6))
plt.plot(s_values, f_values, label=fr"$f(s) = 1 - (1 - s^{{{r}}})^{{{b}}}$", color='blue')
plt.xlabel("s")
plt.ylabel("f(s)")
plt.title(f"Plot of the function $f(s) = 1 - (1 - s^{{{r}}})^{{{b}}}$")
plt.legend()
plt.grid(True)
plt.show()
