Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 156 lines (116 sloc) 6.02 kb
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
1 Distinct Estimators
2 ===================
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
3
d5e88ba @tvondra Improved READMEs (info about the new estimators, postgresql-hll etc.)
authored
4 This repository contains seven PostgreSQL extension, each with
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
5 a different statistical estimator, useful for estimating number
6 of distinct items in a data set.
7
8 If you need to track distinct elements, and you can settle with
9 a reasonably precise estimate, this may be an interesting option.
d5e88ba @tvondra Improved READMEs (info about the new estimators, postgresql-hll etc.)
authored
10 These extensions can give you estimates with about 1% error, with
11 very little memory (usually just a few kilobytes) and are much
12 faster than the usual DISTINCT aggregate.
13
14 So if you don't need exact counts (and many applications can work
15 with estimates withouth any problem), this may be a great tool.
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
16
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
17 I wrote a [short article](http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/)
18 about it a while ago and now I've finally finished these extensions.
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
19
20
21 Usage as an aggregate
22 ---------------------
23 There are two ways to use those extensions - either as an aggregate
24 or as a data type (for a column). Let's see the aggregate first ...
25
d5e88ba @tvondra Improved READMEs (info about the new estimators, postgresql-hll etc.)
authored
26 There are seven extensions, each one provides an aggregate
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
27
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
28 1. hyperloglog_distinct(anyelement)
29 2. adaptive_distinct(anyelement)
30 3. bitmap_distinct(anyelement)
31 4. pcsa_distinct(anyelement)
32 5. probabilistic_distinct(anyelement)
33 6. loglog_distinct(anyelement)
34 7. superloglog_distinct(anyelement)
8fb7ce2 @tvondra Improved README with info about all the estimators.
authored
35
36 and about the same aggregates with parameters (mostly to tweak precision
37 and memory requirements).
38
39 If you don't know which of the estimators to use, use hyperloglog - it's
40 state of the art estimator, providing precise estimates with very low memory
d5e88ba @tvondra Improved READMEs (info about the new estimators, postgresql-hll etc.)
authored
41 requirements.
42
43 My second favourite one is the adaptive estimator, but it's mostly
44 because how elegant the implementation feels.
45
46 The estimators are not completely independent - some of them are
47 (improved) successors of the other estimators. Using 'A > B' notation
48 to express that A is an improved version of B, the relationships might
49 be written like this:
50
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
51 * hyperloglog > superloglog > loglog
52 * pcsa > probabilistic
d5e88ba @tvondra Improved READMEs (info about the new estimators, postgresql-hll etc.)
authored
53
54 The 'bitmap' and 'adaptive' estimators are pretty much independent,
55 although all the estimators are based on probability and the ideas are
56 often quite similar.
57
58 Feel free experimenting with the various estimators, effect of the
59 parameters etc.
60
61
62 Aggregate parameters
63 --------------------
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
64 For adaptive and bitmap estimators those parameters are demanded error
d5e88ba @tvondra Improved READMEs (info about the new estimators, postgresql-hll etc.)
authored
65 rate and expected number of distinct values, so for example if you want
66 an estimate with 1% error and you expect there are 1.000.000 distinct
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
67 items, you can do this
68
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
69 db=# SELECT adaptive_distinct(i, 0.01, 1000000)
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
70 FROM generate_series(1,1000000);
71
72 and likewise for the bitmap estimator.
73
74 In case of the pcsa/probabilistic estimators, the parameters are not
75 that straightforward - in most cases you have to play with the
76 estimator a bit to find out what parameters to use (see more detail
77 in the readme for the estimators).
78
79 Or you may forget about the parameters and use just a simplified
80 version of the aggregates without the parameters, i.e.
81
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
82 db=# SELECT adaptive_distinct(i) FROM generate_series(1,1000000);
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
83
84 which uses some carefully selected parameter values, that work quite
85 well in most cases. But if you want to get the best performance (low
86 memory requirements, fast execution and precise results), you'll have
87 to play a bit with the estimators.
88
89
90 Usage as a data type (for a column)
91 -----------------------------------
92 Each of the estimators provides a separate data type (based on bytea),
93 and you may use it as a column in a table or in a PL/pgSQL procedure.
94 For example to create an table with an adaptive_estimator, you can
95 do this:
96
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
97 CREATE TABLE article (
98 ...
99 cnt_visitors adaptive_estimator DEFAULT adaptive_init(0.01, 1e6)
100 ...
101 );
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
102
103 What is it good for? Well, you can continuously update the column and
104 know how many distinct visitors (by IP or whatever you choose) already
105 read the article. You can't do that with a plain INT column.
106
107 The counter may be easily updated by a trigger, just use the proper
108 add_item function to add the items and get_estimate to get the current
109 estimate. The are a few other functions, e.g. "reset" (that effectively
110 sets the counter to 0) and "size" (returns memory requirements of an
111 estimator with supplied parameters).
112
113 Anyway be careful about the implementation, as the estimators may
114 easily occupy several kilobytes (depends on the precision etc.). Keep
115 in mind that the PostgreSQL MVCC works so that it creates a copy of
116 the row on update, an that may easily lead to bloat. So group the
117 updates or something like that.
118
119
120 Differences
121 -----------
122 It's difficult to give a clear rule which of the extensions to choose,
123 I recommend to play with them a bit, try them on an existing data set
124 and you'll see which one performs best.
125
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
126 Anyway, the general rules are that:
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
127
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
128 1. The higher the precision and expected number of distinct values,
129 the more memory / CPU time you'll need.
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
130
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
131 2. HyperLogLog is pretty much the state of the art estimator. There
132 are probably cases when one of the others performs better, but
133 by default you should probably use HLL.
134
135 3. Adaptive/bitmap estimators are generally easier to work with, as
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
136 you can easily set the error rate / expected number of distinct
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
137 values directly.
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
138
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
139 4. The pcsa/probabilistic estimators are much simpler but it's not clear
140 what precision you'll get from particular parameter values. So you
141 need to play with them a bit to get an idea what are the right
142 parameter values.
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
143
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
144 5. Pcsa/probabilistic estimators usually require less memory than
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
145 the adaptive/bitmap.
146
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
147 6. Adaptive estimator has a very interesting feature - you can
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
148 merge multiple estimators into a single one, providing a global
149 distinct estimate. For example you may maintain counters for
150 each article category, and later merge the estimators to get
151 an estimate for a group of categories.
7a2d1b0 @tvondra Basic info about usage, differences etc.
authored
152
9061fb7 @tvondra Small improvements in the top-level README - formatting, additional info
authored
153 7. The pcsa tend to give bad results for low numbers of distinct
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
154 values - that's where adaptive/bitmap clearly win.
a9f5033 @tvondra Slight improvement of the gloval README (info about pcsa precision).
authored
155
ea32c8d @tvondra Switched to markdown for READMEs, added them as 'docfile' to META.json.
authored
156 See READMEs for individual estimators for more details.
Something went wrong with that request. Please try again.