From 2bae25df677e809f3dec9f32bfc132c5d51ead95 Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 24 Jun 2025 10:28:59 +0100 Subject: [PATCH 1/3] DOC-5227 added Jedis probabilistic data type examples --- content/develop/clients/jedis/prob.md | 223 ++++++++++++++++++++++++++ 1 file changed, 223 insertions(+) create mode 100644 content/develop/clients/jedis/prob.md diff --git a/content/develop/clients/jedis/prob.md b/content/develop/clients/jedis/prob.md new file mode 100644 index 0000000000..c426875e93 --- /dev/null +++ b/content/develop/clients/jedis/prob.md @@ -0,0 +1,223 @@ +--- +categories: +- docs +- develop +- stack +- oss +- rs +- rc +- oss +- kubernetes +- clients +description: Learn how to use approximate calculations with Redis. +linkTitle: Probabilistic data types +title: Probabilistic data types +weight: 5 +--- + +Redis supports several +[probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}}) +that let you calculate values approximately rather than exactly. +The types fall into two basic categories: + +- [Set operations](#set-operations): These types let you calculate (approximately) + the number of items in a set of distinct values, and whether or not a given value is + a member of a set. +- [Statistics](#statistics): These types give you an approximation of + statistics such as the quantiles, ranks, and frequencies of numeric data points in + a list. + +To see why these approximate calculations would be useful, consider the task of +counting the number of distinct IP addresses that access a website in one day. + +Assuming that you already have code that supplies you with each IP +address as a string, you could record the addresses in Redis using +a [set]({{< relref "/develop/data-types/sets" >}}): + +```java +jedis.sadd("ip_tracker", new_ip_address) +``` + +The set can only contain each key once, so if the same address +appears again during the day, the new instance will not change +the set. At the end of the day, you could get the exact number of +distinct addresses using the `scard()` function: + +```java +long num_distinct_ips = jedis.scard("ip_tracker") +``` + +This approach is simple, effective, and precise but if your website +is very busy, the `ip_tracker` set could become very large and consume +a lot of memory. + +You would probably round the count of distinct IP addresses to the +nearest thousand or more to deliver the usage statistics, so +getting it exactly right is not important. It would be useful +if you could trade off some accuracy in exchange for lower memory +consumption. The probabilistic data types provide exactly this kind of +trade-off. Specifically, you can count the approximate number of items in a +set using the [HyperLogLog](#set-cardinality) data type, as described below. + +In general, the probabilistic data types let you perform approximations with a +bounded degree of error that have much lower memory consumption or execution +time than the equivalent precise calculations. + +## Set operations + +Redis supports the following approximate set operations: + +- [Membership](#set-membership): The + [Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and + [Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}}) + data types let you track whether or not a given item is a member of a set. +- [Cardinality](#set-cardinality): The + [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}}) + data type gives you an approximate value for the number of items in a set, also + known as the *cardinality* of the set. + +The sections below describe these operations in more detail. + +### Set membership + +[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and +[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}}) +objects provide a set membership operation that lets you track whether or not a +particular item has been added to a set. These two types provide different +trade-offs for memory usage and speed, so you can select the best one for your +use case. Note that for both types, there is an asymmetry between presence and +absence of items in the set. If an item is reported as absent, then it is definitely +absent, but if it is reported as present, then there is a small chance it may really be +absent. + +Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}), +a Bloom filter records the presence or absence of the +[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string. +This gives a very compact representation of the +set's membership with a fixed memory size, regardless of how many items you +add. The following example adds some names to a Bloom filter representing +a list of users and checks for the presence or absence of users in the list. + +{{< clients-example home_prob_dts bloom Java-Sync >}} +{{< /clients-example >}} + +A Cuckoo filter has similar features to a Bloom filter, but also supports +a deletion operation to remove hashes from a set, as shown in the example +below. + +{{< clients-example home_prob_dts cuckoo Java-Sync >}} +{{< /clients-example >}} + +Which of these two data types you choose depends on your use case. +Bloom filters are generally faster than Cuckoo filters when adding new items, +and also have better memory usage. Cuckoo filters are generally faster +at checking membership and also support the delete operation. See the +[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and +[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}}) +reference pages for more information and comparison between the two types. + +### Set cardinality + +A [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}}) +object calculates the cardinality of a set. As you add +items, the HyperLogLog tracks the number of distinct set members but +doesn't let you retrieve them or query which items have been added. +You can also merge two or more HyperLogLogs to find the cardinality of the +[union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they +represent. + +{{< clients-example home_prob_dts hyperloglog Java-Sync >}} +{{< /clients-example >}} + +The main benefit that HyperLogLogs offer is their very low +memory usage. They can count up to 2^64 items with less than +1% standard error using a maximum 12KB of memory. This makes +them very useful for counting things like the total of distinct +IP addresses that access a website or the total of distinct +bank card numbers that make purchases within a day. + +## Statistics + +Redis supports several approximate statistical calculations +on numeric data sets: + +- [Frequency](#frequency): The + [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}}) + data type lets you find the approximate frequency of a labeled item in a data stream. +- [Quantiles](#quantiles): The + [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}}) + data type estimates the quantile of a query value in a data stream. +- [Ranking](#ranking): The + [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type + estimates the ranking of labeled items by frequency in a data stream. + +The sections below describe these operations in more detail. + +### Frequency + +A [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}}) +(CMS) object keeps count of a set of related items represented by +string labels. The count is approximate, but you can specify +how close you want to keep the count to the true value (as a fraction) +and the acceptable probability of failing to keep it in this +desired range. For example, you can request that the count should +stay within 0.1% of the true value and have a 0.05% probability +of going outside this limit. The example below shows how to create +a Count-min sketch object, add data to it, and then query it. + +{{< clients-example home_prob_dts cms Java-Sync >}} +{{< /clients-example >}} + +The advantage of using a CMS over keeping an exact count with a +[sorted set]({{< relref "/develop/data-types/sorted-sets" >}}) +is that that a CMS has very low and fixed memory usage, even for +large numbers of items. Use CMS objects to keep daily counts of +items sold, accesses to individual web pages on your site, and +other similar statistics. + +### Quantiles + +A [quantile](https://en.wikipedia.org/wiki/Quantile) is the value +below which a certain fraction of samples lie. For example, with +a set of measurements of people's heights, the quantile of 0.75 is +the value of height below which 75% of all people's heights lie. +[Percentiles](https://en.wikipedia.org/wiki/Percentile) are equivalent +to quantiles, except that the fraction is expressed as a percentage. + +A [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}}) +object can estimate quantiles from a set of values added to it +without having to store each value in the set explicitly. This can +save a lot of memory when you have a large number of samples. + +The example below shows how to add data samples to a t-digest +object and obtain some basic statistics, such as the minimum and +maximum values, the quantile of 0.75, and the +[cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) +(CDF), which is effectively the inverse of the quantile function. It also +shows how to merge two or more t-digest objects to query the combined +data set. + +{{< clients-example home_prob_dts tdigest Java-Sync >}} +{{< /clients-example >}} + +A t-digest object also supports several other related commands, such +as querying by rank. See the +[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}}) +reference for more information. + +### Ranking + +A [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) +object estimates the rankings of different labeled items in a data +stream according to frequency. For example, you could use this to +track the top ten most frequently-accessed pages on a website, or the +top five most popular items sold. + +The example below adds several different items to a Top-K object +that tracks the top three items (this is the second parameter to +the `topkReserve()` method). It also shows how to list the +top *k* items and query whether or not a given item is in the +list. + +{{< clients-example home_prob_dts topk Java-Sync >}} +{{< /clients-example >}} From c6d91ea4ce2103c9f3f2a41151a1c99510ff9a9a Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 24 Jun 2025 10:59:01 +0100 Subject: [PATCH 2/3] DOC-5227 added inline examples --- content/develop/clients/jedis/prob.md | 185 ++++++++++++++++++++++++-- 1 file changed, 173 insertions(+), 12 deletions(-) diff --git a/content/develop/clients/jedis/prob.md b/content/develop/clients/jedis/prob.md index c426875e93..8ebd753960 100644 --- a/content/develop/clients/jedis/prob.md +++ b/content/develop/clients/jedis/prob.md @@ -98,15 +98,50 @@ set's membership with a fixed memory size, regardless of how many items you add. The following example adds some names to a Bloom filter representing a list of users and checks for the presence or absence of users in the list. -{{< clients-example home_prob_dts bloom Java-Sync >}} -{{< /clients-example >}} +```java +List res1 = jedis.bfMAdd( + "recorded_users", + "andy", "cameron", "david", "michelle" +); +System.out.println(res1); // >>> [true, true, true, true] + +boolean res2 = jedis.bfExists("recorded_users", "cameron"); +System.out.println(res2); // >>> true + +boolean res3 = jedis.bfExists("recorded_users", "kaitlyn"); +System.out.println(res3); // >>> false +``` + A Cuckoo filter has similar features to a Bloom filter, but also supports a deletion operation to remove hashes from a set, as shown in the example below. -{{< clients-example home_prob_dts cuckoo Java-Sync >}} -{{< /clients-example >}} + +```java +boolean res4 = jedis.cfAdd("other_users", "paolo"); +System.out.println(res4); // >>> true + +boolean res5 = jedis.cfAdd("other_users", "kaitlyn"); +System.out.println(res5); // >>> true + +boolean res6 = jedis.cfAdd("other_users", "rachel"); +System.out.println(res6); // >>> true + +boolean[] res7 = jedis.cfMExists( + "other_users", + "paolo", "rachel", "andy" +); +System.out.println(res7); // >>> [true, true, false] + +boolean res8 = jedis.cfDel("other_users", "paolo"); +System.out.println(res8); // >>> true + +boolean res9 = jedis.cfExists("other_users", "paolo"); +System.out.println(res9); // >>> false +``` Which of these two data types you choose depends on your use case. Bloom filters are generally faster than Cuckoo filters when adding new items, @@ -126,8 +161,30 @@ You can also merge two or more HyperLogLogs to find the cardinality of the [union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they represent. -{{< clients-example home_prob_dts hyperloglog Java-Sync >}} -{{< /clients-example >}} + +```java +long res10 = jedis.pfadd("group:1", "andy", "cameron", "david"); +System.out.println(res10); // >>> 1 + +long res11 = jedis.pfcount("group:1"); +System.out.println(res11); // >>> 3 + +long res12 = jedis.pfadd( + "group:2", + "kaitlyn", "michelle", "paolo", "rachel" +); +System.out.println(res12); // >>> 1 + +long res13 = jedis.pfcount("group:2"); +System.out.println(res13); // >>> 4 + +String res14 = jedis.pfmerge("both_groups", "group:1", "group:2"); +System.out.println(res14); // >>> OK + +long res15 = jedis.pfcount("both_groups"); +System.out.println(res15); // >>> 7 +``` The main benefit that HyperLogLogs offer is their very low memory usage. They can count up to 2^64 items with less than @@ -165,8 +222,44 @@ stay within 0.1% of the true value and have a 0.05% probability of going outside this limit. The example below shows how to create a Count-min sketch object, add data to it, and then query it. -{{< clients-example home_prob_dts cms Java-Sync >}} -{{< /clients-example >}} + +```java +// Specify that you want to keep the counts within 0.01 +// (0.1%) of the true value with a 0.005 (0.05%) chance +// of going outside this limit. +String res16 = jedis.cmsInitByProb("items_sold", 0.01, 0.005); +System.out.println(res16); // >>> OK + +Map firstItemIncrements = new HashMap<>(); +firstItemIncrements.put("bread", 300L); +firstItemIncrements.put("tea", 200L); +firstItemIncrements.put("coffee", 200L); +firstItemIncrements.put("beer", 100L); + +List res17 = jedis.cmsIncrBy("items_sold", + firstItemIncrements +); +res17.sort(null); +System.out.println(); // >>> [100, 200, 200, 300] + +Map secondItemIncrements = new HashMap<>(); +secondItemIncrements.put("bread", 100L); +secondItemIncrements.put("coffee", 150L); + +List res18 = jedis.cmsIncrBy("items_sold", + secondItemIncrements +); +res18.sort(null); +System.out.println(res18); // >>> [350, 400] + +List res19 = jedis.cmsQuery( + "items_sold", + "bread", "tea", "coffee", "beer" +); +res19.sort(null); +System.out.println(res19); // >>> [100, 200, 350, 400] +``` The advantage of using a CMS over keeping an exact count with a [sorted set]({{< relref "/develop/data-types/sorted-sets" >}}) @@ -197,8 +290,48 @@ maximum values, the quantile of 0.75, and the shows how to merge two or more t-digest objects to query the combined data set. -{{< clients-example home_prob_dts tdigest Java-Sync >}} -{{< /clients-example >}} + +```java +String res20 = jedis.tdigestCreate("male_heights"); +System.out.println(res20); // >>> OK + +String res21 = jedis.tdigestAdd("male_heights", + 175.5, 181, 160.8, 152, 177, 196, 164); +System.out.println(res21); // >>> OK + +double res22 = jedis.tdigestMin("male_heights"); +System.out.println(res22); // >>> 152.0 + +double res23 = jedis.tdigestMax("male_heights"); +System.out.println(res23); // >>> 196.0 + +List res24 = jedis.tdigestQuantile("male_heights", 0.75); +System.out.println(res24); // >>> [181.0] + +// Note that the CDF value for 181 is not exactly 0.75. +// Both values are estimates. +List res25 = jedis.tdigestCDF("male_heights", 181); +System.out.println(res25); // >>> [0.7857142857142857] + +String res26 = jedis.tdigestCreate("female_heights"); +System.out.println(res26); // >>> OK + +String res27 = jedis.tdigestAdd("female_heights", + 155.5, 161, 168.5, 170, 157.5, 163, 171); +System.out.println(res27); // >>> OK + +List res28 = jedis.tdigestQuantile("female_heights", 0.75); +System.out.println(res28); // >>> [170.0] + +String res29 = jedis.tdigestMerge( + "all_heights", + "male_heights", "female_heights" +); +System.out.println(res29); // >>> OK +List res30 = jedis.tdigestQuantile("all_heights", 0.75); +System.out.println(res30); // >>> [175.5] +``` A t-digest object also supports several other related commands, such as querying by rank. See the @@ -219,5 +352,33 @@ the `topkReserve()` method). It also shows how to list the top *k* items and query whether or not a given item is in the list. -{{< clients-example home_prob_dts topk Java-Sync >}} -{{< /clients-example >}} + +```java +String res31 = jedis.topkReserve("top_3_songs", 3L, 2000L, 7L, 0.925D); +System.out.println(res31); // >>> OK + +Map songIncrements = new HashMap<>(); +songIncrements.put("Starfish Trooper", 3000L); +songIncrements.put("Only one more time", 1850L); +songIncrements.put("Rock me, Handel", 1325L); +songIncrements.put("How will anyone know?", 3890L); +songIncrements.put("Average lover", 4098L); +songIncrements.put("Road to everywhere", 770L); + +List res32 = jedis.topkIncrBy("top_3_songs", + songIncrements +); +System.out.println(res32); +// >>> [null, null, null, null, null, Rock me, Handel] + +List res33 = jedis.topkList("top_3_songs"); +System.out.println(res33); +// >>> [Average lover, How will anyone know?, Starfish Trooper] + +List res34 = jedis.topkQuery("top_3_songs", + "Starfish Trooper", "Road to everywhere" +); +System.out.println(res34); +// >>> [true, false] +``` From aa5a78a92ccd8cc110703c963c14e2659adeac64 Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 24 Jun 2025 14:55:42 +0100 Subject: [PATCH 3/3] DOC-5227 fixed incorrect percentages --- content/develop/clients/jedis/prob.md | 2 +- content/develop/clients/redis-py/prob.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/develop/clients/jedis/prob.md b/content/develop/clients/jedis/prob.md index 8ebd753960..dab5ab417e 100644 --- a/content/develop/clients/jedis/prob.md +++ b/content/develop/clients/jedis/prob.md @@ -226,7 +226,7 @@ a Count-min sketch object, add data to it, and then query it. < /clients-example >}}--> ```java // Specify that you want to keep the counts within 0.01 -// (0.1%) of the true value with a 0.005 (0.05%) chance +// (1%) of the true value with a 0.005 (0.5%) chance // of going outside this limit. String res16 = jedis.cmsInitByProb("items_sold", 0.01, 0.005); System.out.println(res16); // >>> OK diff --git a/content/develop/clients/redis-py/prob.md b/content/develop/clients/redis-py/prob.md index 8547f50803..4fd3d92707 100644 --- a/content/develop/clients/redis-py/prob.md +++ b/content/develop/clients/redis-py/prob.md @@ -222,7 +222,7 @@ sketch commands. ```py # Specify that you want to keep the counts within 0.01 -# (0.1%) of the true value with a 0.005 (0.05%) chance +# (1%) of the true value with a 0.005 (0.5%) chance # of going outside this limit. res16 = r.cms().initbyprob("items_sold", 0.01, 0.005) print(res16) # >>> True