Permalink
Browse files

Switched to markdown for READMEs, added them as 'docfile' to META.json.

  • Loading branch information...
1 parent 63b703f commit ea32c8d3ad8a7f8d770f2e0f9047d88dc12fa2c1 @tvondra committed Apr 4, 2012
Showing with 376 additions and 374 deletions.
  1. +31 −33 README → README.md
  2. +1 −0 adaptive/META.json
  3. +0 −88 adaptive/README
  4. +88 −0 adaptive/README.md
  5. +1 −0 bitmap/META.json
  6. +0 −87 bitmap/README
  7. +87 −0 bitmap/README.md
  8. +1 −0 pcsa/META.json
  9. +0 −83 pcsa/README
  10. +83 −0 pcsa/README.md
  11. +1 −0 probabilistic/META.json
  12. +0 −83 probabilistic/README
  13. +83 −0 probabilistic/README.md
View
@@ -1,5 +1,6 @@
Distinct Estimators
===================
+
This repository contains four PostgreSQL extension, each with
a different statistical estimator, useful for estimating number
of distinct items in a data set.
@@ -10,11 +11,8 @@ These extensions may easily give you estimates with about 1% error,
very little memory (usually just few kilobytes) and much faster
than the usual DISTINCT aggregate.
-I wrote a short article about it a while ago
-
- http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/
-
-and now I've finally finished these extensions.
+I wrote a [short article](http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/)
+about it a while ago and now I've finally finished these extensions.
Usage as an aggregate
@@ -24,10 +22,10 @@ or as a data type (for a column). Let's see the aggregate first ...
There are four extensions, each one provides an aggregate
- 1) adaptive_distinct(int, real, int)
- 2) bitmap_distinct(int, real, int)
- 3) pcsa_distinct(int, int, int)
- 4) probabilistic_distinct(int, int, int)
+1. adaptive_distinct(int, real, int)
+2. bitmap_distinct(int, real, int)
+3. pcsa_distinct(int, int, int)
+4. probabilistic_distinct(int, int, int)
and about the same aggregates for text values. The second and third
parameter are parameters of the estimator, usually affecting precision
@@ -38,7 +36,7 @@ rate and expected number of distinct values, so for example to get an
estimate with 1% error and you expect there are 1.000.000 distinct
items, you can do this
- db=# SELECT adaptive_distinct(i, 0.01, 1000000)
+ db=# SELECT adaptive_distinct(i, 0.01, 1000000)
FROM generate_series(1,1000000);
and likewise for the bitmap estimator.
@@ -51,7 +49,7 @@ in the readme for the estimators).
Or you may forget about the parameters and use just a simplified
version of the aggregates without the parameters, i.e.
- db=# SELECT adaptive_distinct(i) FROM generate_series(1,1000000);
+ db=# SELECT adaptive_distinct(i) FROM generate_series(1,1000000);
which uses some carefully selected parameter values, that work quite
well in most cases. But if you want to get the best performance (low
@@ -66,11 +64,11 @@ and you may use it as a column in a table or in a PL/pgSQL procedure.
For example to create an table with an adaptive_estimator, you can
do this:
-CREATE TABLE article (
- ...
- cnt_visitors adaptive_estimator DEFAULT adaptive_init(0.01, 1e6)
- ...
-);
+ CREATE TABLE article (
+ ...
+ cnt_visitors adaptive_estimator DEFAULT adaptive_init(0.01, 1e6)
+ ...
+ );
What is it good for? Well, you can continuously update the column and
know how many distinct visitors (by IP or whatever you choose) already
@@ -97,27 +95,27 @@ and you'll see which one performs best.
Anyway my experiences are that:
- a) The higher the precision and expected number of distinct values,
- the more memory / CPU time you'll need.
+1. The higher the precision and expected number of distinct values,
+ the more memory / CPU time you'll need.
- b) Adaptive/bitmap estimators are generally easier to work with, as
- you can easily set the error rate / expected number of distinct
- values.
+2. Adaptive/bitmap estimators are generally easier to work with, as
+ you can easily set the error rate / expected number of distinct
+ values.
- c) The pcsa/probabilistic estimators are much simpler and you need
- to play with them to get an idea what the right parameter values
- are.
+3. The pcsa/probabilistic estimators are much simpler and you need
+ to play with them to get an idea what the right parameter values
+ are.
- d) Pcsa/probabilistic estimators usually require less memory than
+4. Pcsa/probabilistic estimators usually require less memory than
the adaptive/bitmap.
- e) Adaptive estimator has a very interesting feature - you can
- merge multiple estimators into a single one, providing a global
- distinct estimate. For example you may maintain counters for
- each article category, and later merge the estimators to get
- an estimate for a group of categories.
+5. Adaptive estimator has a very interesting feature - you can
+ merge multiple estimators into a single one, providing a global
+ distinct estimate. For example you may maintain counters for
+ each article category, and later merge the estimators to get
+ an estimate for a group of categories.
- f) The pcsa tend to give bad results for low numbers of distinct
- values - that's where adaptive/bitmap clearly win.
+6. The pcsa tend to give bad results for low numbers of distinct
+ values - that's where adaptive/bitmap clearly win.
-See README for individual estimators for more details.
+See READMEs for individual estimators for more details.
View
@@ -15,6 +15,7 @@
"provides": {
"adaptive_estimator": {
"file": "adaptive_counter--1.0.sql",
+ "docfile" : "README.md",
"version": "1.0.0"
}
},
View
@@ -1,88 +0,0 @@
-Adaptive Distinct Estimator
-===========================
-
-This is an implementation of Adaptive Sampling algorithm presented in
-paper "On Adaptive Sampling" pub. in 1990 (written by P. Flajolet).
-
-Contents of the extension
--------------------------
-The extension provides the following elements
-
- a) adaptive_estimator data type (may be used for columns, in PL/pgSQL)
-
- b) functions to work with the adaptive_estimator data type
-
- - adaptive_size(real error, item_size int)
- - adaptive_init(real error, item_size int)
-
- - adaptive_add_item(adaptive_estimator counter, item text)
- - adaptive_add_item(adaptive_estimator counter, item int)
-
- - adaptive_get_estimate(adaptive_estimator counter)
- - adaptive_get_error(adaptive_estimator counter)
- - adaptive_get_ndistinct(adaptive_estimator counter)
- - adaptive_get_item_size(adaptive_estimator counter)
-
- - adaptive_reset(adaptive_estimator counter)
- - adaptive_merge(adaptive_estimator c1, adaptive_estimator c2)
-
- - length(adaptive_estimator counter)
-
- The purpose of the functions is quite obvious from the names,
- alternatively consult the SQL script for more details.
-
- c) aggregate functions
-
- - adaptive_distinct(text, real, int)
- - adaptive_distinct(text)
-
- - adaptive_distinct(int, real, int)
- - adaptive_distinct(int)
-
- where the 1-parameter version uses 0.025 (2.5%) and 1.000.000
- as default values for the two parameters. That's quite generous
- and it may result in unnecessarily large estimators, so if you
- can work with lower precision / expect less distinct values,
- pass the parameters explicitly.
-
-
-Usage
------
-Using the aggregate is quite straightforward - just use it like a
-regular aggregate function
-
- db=# SELECT adaptive_distinct(i, 0.01, 100000)
- FROM generate_series(1,100000) s(i);
-
-and you can use it from a PL/pgSQL (or another PL) like this:
-
- DO LANGUAGE plpgsql $$
- DECLARE
- v_counter adaptive_estimator := adaptive_init(0.01,10000);
- v_estimate real;
- BEGIN
- PERFORM adaptive_add_item(v_counter, 1);
- PERFORM adaptive_add_item(v_counter, 2);
- PERFORM adaptive_add_item(v_counter, 3);
-
- SELECT adaptive_get_estimate(v_counter) INTO v_estimate;
-
- RAISE NOTICE 'estimate = %',v_estimate;
- END$$;
-
-And this can be done from a trigger (updating an estimate stored
-in a table).
-
-
-Problems
---------
-Be careful about the implementation, as the estimators may easily
-occupy several kilobytes (depends on the precision etc.). Keep in
-mind that the PostgreSQL MVCC works so that it creates a copy of
-the row on update, an that may easily lead to bloat. So group the
-updates or something like that.
-
-This is of course made worse by using unnecessarily large estimators,
-so always tune the estimator to use the lowest acceptable precision
-and lowest expected number of distinct elements (because that's what
-increases the estimator size).
View
@@ -0,0 +1,88 @@
+Adaptive Distinct Estimator
+===========================
+
+This is an implementation of Adaptive Sampling algorithm presented in
+paper "On Adaptive Sampling" pub. in 1990 (written by P. Flajolet).
+
+Contents of the extension
+-------------------------
+The extension provides the following elements
+
+* adaptive_estimator data type (may be used for columns, in PL/pgSQL)
+
+* functions to work with the adaptive_estimator data type
+
+ * `adaptive_size(error real, item_size int)`
+ * `adaptive_init(error real, item_size int)`
+
+ * `adaptive_add_item(adaptive_estimator counter, item text)`
+ * `adaptive_add_item(adaptive_estimator counter, item int)`
+
+ * `adaptive_get_estimate(adaptive_estimator counter)`
+ * `adaptive_get_error(adaptive_estimator counter)`
+ * `adaptive_get_ndistinct(adaptive_estimator counter)`
+ * `adaptive_get_item_size(adaptive_estimator counter)`
+
+ * `adaptive_reset(adaptive_estimator counter)`
+
+ * `adaptive_merge(adaptive_estimator c1, adaptive_estimator c2)`
+
+ * `length(adaptive_estimator counter)`
+
+ The purpose of the functions is quite obvious from the names,
+ alternatively consult the SQL script for more details.
+
+* aggregate functions
+
+ * `adaptive_distinct(text, real, int)`
+ * `adaptive_distinct(text)`
+ * `adaptive_distinct(int, real, int)`
+ * `adaptive_distinct(int)`
+
+ where the 1-parameter version uses 0.025 (2.5%) and 1.000.000
+ as default values for the two parameters. That's quite generous
+ and it may result in unnecessarily large estimators, so if you
+ can work with lower precision / expect less distinct values,
+ pass the parameters explicitly.
+
+
+Usage
+-----
+Using the aggregate is quite straightforward - just use it like a
+regular aggregate function
+
+ db=# SELECT adaptive_distinct(i, 0.01, 100000)
+ FROM generate_series(1,100000) s(i);
+
+and you can use it from a PL/pgSQL (or another PL) like this:
+
+ DO LANGUAGE plpgsql $$
+ DECLARE
+ v_counter adaptive_estimator := adaptive_init(0.01,10000);
+ v_estimate real;
+ BEGIN
+ PERFORM adaptive_add_item(v_counter, 1);
+ PERFORM adaptive_add_item(v_counter, 2);
+ PERFORM adaptive_add_item(v_counter, 3);
+
+ SELECT adaptive_get_estimate(v_counter) INTO v_estimate;
+
+ RAISE NOTICE 'estimate = %',v_estimate;
+ END$$;
+
+And this can be done from a trigger (updating an estimate stored
+in a table).
+
+
+Problems
+--------
+Be careful about the implementation, as the estimators may easily
+occupy several kilobytes (depends on the precision etc.). Keep in
+mind that the PostgreSQL MVCC works so that it creates a copy of
+the row on update, an that may easily lead to bloat. So group the
+updates or something like that.
+
+This is of course made worse by using unnecessarily large estimators,
+so always tune the estimator to use the lowest acceptable precision
+and lowest expected number of distinct elements (because that's what
+increases the estimator size).
View
@@ -15,6 +15,7 @@
"provides": {
"pcsa_estimator": {
"file": "bitmap_estimator--1.0.sql",
+ "docfile" : "README.md",
"version": "1.0.0"
}
},
View
@@ -1,87 +0,0 @@
-Bitmap Distinct Estimator
-=========================
-
-This is an implementation of self-learning bitmap, as described in the
-paper "Distinct Counting with a Self-Learning Bitmap" (by Aiyou Chen
-and Jin Cao, published in 2009).
-
-Contents of the extension
--------------------------
-The extension provides the following elements
-
- a) bitmap_estimator data type (may be used for columns, in PL/pgSQL)
-
- b) functions to work with the bitmap_estimator data type
-
- - bitmap_size(real error, item_size int)
- - bitmap_init(real error, item_size int)
-
- - bitmap_add_item(bitmap_estimator counter, item text)
- - bitmap_add_item(bitmap_estimator counter, item int)
-
- - bitmap_get_estimate(bitmap_estimator counter)
- - bitmap_get_error(bitmap_estimator counter)
- - bitmap_get_ndistinct(bitmap_estimator counter)
-
- - bitmap_reset(bitmap_estimator counter)
-
- - length(bitmap_estimator counter)
-
- The purpose of the functions is quite obvious from the names,
- alternatively consult the SQL script for more details.
-
- c) aggregate functions
-
- - bitmap_distinct(text, real, int)
- - bitmap_distinct(text)
-
- - bitmap_distinct(int, real, int)
- - bitmap_distinct(int)
-
- where the 1-parameter version uses 0.025 (2.5%) and 1.000.000
- as default values for the two parameters. That's quite generous
- and it may result in unnecessarily large estimators, so if you
- can work with lower precision / expect less distinct values,
- pass the parameters explicitly.
-
-
-Usage
------
-Using the aggregate is quite straightforward - just use it like a
-regular aggregate function
-
- db=# SELECT bitmap_distinct(i, 0.01, 100000)
- FROM generate_series(1,100000) s(i);
-
-and you can use it from a PL/pgSQL (or another PL) like this:
-
- DO LANGUAGE plpgsql $$
- DECLARE
- v_counter bitmap_estimator := bitmap_init(0.01,10000);
- v_estimate real;
- BEGIN
- PERFORM bitmap_add_item(v_counter, 1);
- PERFORM bitmap_add_item(v_counter, 2);
- PERFORM bitmap_add_item(v_counter, 3);
-
- SELECT bitmap_get_estimate(v_counter) INTO v_estimate;
-
- RAISE NOTICE 'estimate = %',v_estimate;
- END$$;
-
-And this can be done from a trigger (updating an estimate stored
-in a table).
-
-
-Problems
---------
-Be careful about the implementation, as the estimators may easily
-occupy several kilobytes (depends on the precision etc.). Keep in
-mind that the PostgreSQL MVCC works so that it creates a copy of
-the row on update, an that may easily lead to bloat. So group the
-updates or something like that.
-
-This is of course made worse by using unnecessarily large estimators,
-so always tune the estimator to use the lowest acceptable precision
-and lowest expected number of distinct elements (because that's what
-increases the estimator size).
Oops, something went wrong. Retry.

0 comments on commit ea32c8d

Please sign in to comment.