Merge 9f0c975 into 4e6a9c8

zenodo · Jul 24, 2018 · 7f2bd71 · 7f2bd71
2 parents 4e6a9c8 + 9f0c975
commit 7f2bd71
Showing 1 changed file with 209 additions and 84 deletions.
diff --git a/docs/architecture/statistics.rst b/docs/architecture/statistics.rst
@@ -22,94 +22,109 @@
     as an Intergovernmental Organization or submit itself to any jurisdiction.
 
 Statistics
-----------
+==========
 
 Events
-~~~~~~
+------
 
-The usage statistics of the Zenodo records are generated from two different types of events:
+The usage statistics of the Zenodo records are generated from two different
+types of events:
 
-* ``record-view``: this event is related to the view of a record and it's bounded to the ``record_viewed`` signal, which is
-  emitted when one the following endpoints is accessed:
+* ``record-view``: this event is related to the view of a record and it's
+  bounded to the ``record_viewed`` signal, which is emitted when one of the
+  following endpoints is accessed:
 
     * ``/record/<recid>``
     * ``/record/<recid>/export/<format>``
 
-* ``file-download``: this event is related to the download a file of a record and it's is bounded to
-  the  ``file_downloaded`` signal, which is emitted when one the following endpoints is accessed:
+* ``file-download``: this event is related to the file download from a record
+  and it's bounded to the ``file_downloaded`` signal, which is emitted when
+  one the following endpoints is accessed:
 
     * ``/record/<recid>/files/<filename>``
     * ``/api/record/<recid>/files/<filename>``
 
-Every time a user views a record or downloads a file, the corresponding signal is emitted and the corresponding
-event is created. In case an event fails to be emitted due to an exception raised in an event builder or to a failure
-in sending the message to RabbitMQ, the request is not affected and a warning message is logged.
+Every time a user views a record or downloads a file, the corresponding signal
+is emitted and the corresponding event is created. In case an event fails to be
+emitted due to an exception raised in an event builder or to a failure in
+sending the message to RabbitMQ, the request is not affected and a warning
+message is logged.
 
 At the creation time each event keeps track of the following fields:
 
-* record_id
-* recid
-* conceptrecid
-* doi
-* conceptdoi
-* access_right
-* resource_type
-* communities
-* owners
-* timestamp
-* referrer
-
-We also capture the following user related fields for a limited period of time, i.e. until the event is processed
-and the data are anonymized:
-
-* ip_address
-* user_agent
-* user_id
-* session_id
-
-There are also some fields which are event specific.
-For the ``file-download`` event we track:
-
-* bucket_id
-* file_id
-* file_key
-* size
+* ``record_id``
+* ``recid``
+* ``conceptrecid``
+* ``doi``
+* ``conceptdoi``
+* ``access_right``
+* ``resource_type``
+* ``communities``
+* ``owners``
+* ``timestamp``
+* ``referrer``
+
+We also capture the following user related fields for a limited period of time,
+i.e. until the event is processed and the data are anonymized:
+
+* ``ip_address``
+* ``user_agent``
+* ``user_id``
+* ``session_id``
+
+There are also some fields which are event specific. For the ``file-download``
+event we track:
+
+* ``bucket_id``
+* ``file_id``
+* ``file_key``
+* ``size``
 
 While for the ``record-view`` event we track:
 
-* pid_type
-* pid_value
-
-After an event is created, it is sent to a dedicated RabbitMQ queue. Each type of event has its own queue, so
-different types of events can be processed separately.
-The events which are in the queue are then consumed and processed by a Celery task. In this step the
-flags ``is_machine`` and ``is_robot`` are added, the user is anonymized and the double-clicks are deleted. During the data
-anonymization process the user related fields are deleted and replaced by the ``visitor_id`` and the ``unique_session_id``.
-In case an event fails to be processed (e.g. malformed IP address), the event is skipped/lost and a warning message
-is logged.
-After the processing, the robot events are discarded for space saving, since they are not relevant for the statistics.
-All of the other processed events are saved in Elasticsearch.
+* ``pid_type``
+* ``pid_value``
+
+After an event is created, it is sent to a dedicated RabbitMQ queue. Each type
+of event has its own queue, so different types of events can be processed
+separately. The events which are in the queue are then consumed and processed
+by a Celery task. In this step the flags ``is_machine`` and ``is_robot`` are
+added, the user is anonymized and the double-clicks are deleted. During the
+data anonymization process the user related fields are deleted and replaced by
+the ``visitor_id`` and the ``unique_session_id``. In case an event fails to be
+processed (e.g. malformed IP address), the event is skipped/lost and a warning
+message is logged. After the processing, the robot events are discarded for
+space savings, since they are not relevant for the statistics. All of the other
+processed events are saved in Elasticsearch.
 
 Aggregations
-~~~~~~~~~~~~
-
-All the events generated by the record views or downloads are aggregated in several ways to produce daily statistics.
-Here you can find the list of the different types of aggregations used by Zenodo:
-
-* ``record-view-agg``: this aggregation is applied to the ``record-view`` events and it calculates the daily views and
-  unique views of a specific version of a record;
-* ``record-view-all-versions-agg``: this aggregation is applied to the ``record-view`` events and it calculates the daily
-  views and unique views of all versions of a record;
-* ``record-download-agg``: this aggregation is applied to the ``file-download`` events and it calculates the daily downloads
-  and unique downloads of a specific version of a record;
-* ``record-download-all-versions-agg``: this aggregation is applied to the ``file-download`` events and it calculates
-  the daily downloads and unique downloads of all versions of a record.
-
-Both the ``record-view-agg`` and the ``record-view-all-versions-agg`` are applied to the same ``record-view`` events
-and the documents they produce are stored within the same index. The difference between these two aggregations is
-that, while the first one aggregates the events by the ``recid``, the second one does the aggregation by the
-``conceptrecid``. This leads to two different results: in the first case we have the statistics for a single version of
-a record, while in the second case we have the statistics for all the versions of a record.
+------------
+
+All the events generated by the record views or downloads are aggregated in
+several ways to produce daily statistics. Here you can find the list of the
+different types of aggregations used by Zenodo:
+
+* ``record-view-agg``: this aggregation is applied to the ``record-view``
+  events and it calculates the daily views and unique views of a specific
+  version of a record;
+* ``record-view-all-versions-agg``: this aggregation is applied to the
+  ``record-view`` events and it calculates the daily views and unique views of
+  all versions of a record;
+* ``record-download-agg``: this aggregation is applied to the ``file-download``
+  events and it calculates the daily downloads and unique downloads of a
+  specific version of a record;
+* ``record-download-all-versions-agg``: this aggregation is applied to the
+  ``file-download`` events and it calculates the daily downloads and unique
+  downloads of all versions of a record.
+
+Both the ``record-view-agg`` and the ``record-view-all-versions-agg`` are
+applied to the same ``record-view`` events and the aggregate documents they
+produce are stored in the same indices/alias (``stats-record-view``). The
+difference between these two aggregations is that, while the first one
+aggregates the events by ``recid``, the second one does the aggregation by
+``conceptrecid``. This leads to two different results: in the first case we
+have the statistics for a single version of a record, while in the second case
+we have the statistics for all the versions of a record.
 
 For example, let's say that we have the following ``record-view`` events:
 
@@ -137,7 +152,8 @@ For example, let's say that we have the following ``record-view`` events:
     }
 
 
-The result of the ``record-view-agg`` will be two documents, one for each version of the record:
+The result of the ``record-view-agg`` will be two documents, one for each
+version of the record:
 
 .. code-block:: python
 
@@ -161,8 +177,8 @@ The result of the ``record-view-agg`` will be two documents, one for each versio
         ...
     }
 
-The result of ``record-view-all-versions-agg`` will be one document which summarize the statistics of both versions
-of the record:
+The result of ``record-view-all-versions-agg`` will be one document which
+summarize the statistics of both versions of the record:
 
 .. code-block:: python
 
@@ -177,16 +193,125 @@ of the record:
     }
 
 
-The same happens for the ``record-download-agg`` and the ``record-download-all-versions-agg``, which are applied to the
-``file-download`` events.
-
-In order to count the total number of unique views (and unique downloads) of a record, it's necessary to identify each
-1 hour user-session with a unique id, called ``unique_session_id``. All the views (and all the downloads) made from
-the same user within the same one hour session have the same ``unique_session_id``. In this way we can easily count
-the total number of unique views (or unique downloads) of a record as the cardinality of the ``unique_session_id``
-present in the events related to the record.
-
-All the new aggregations are registered via the ``register_aggregations`` method. The aggregation task runs every hour
-and takes the events from Elasticsearch.
-
-
+The same happens for the ``record-download-agg`` and the
+``record-download-all-versions-agg``, which are applied to the
+``file-download`` events and end up in the ``stats-file-download``
+indices/alias.
+
+In order to count the total number of unique views (and unique downloads) of a
+record, it's necessary to identify each 1 hour user-session with a unique id,
+called ``unique_session_id``. All the views (and all the downloads) made from
+the same user within the same one hour session have the same
+``unique_session_id``. In this way we can easily count the total number of
+unique views (or unique downloads) of a record as the cardinality of the
+``unique_session_id`` present in the events related to the record.
+
+All the new aggregations are registered via the ``register_aggregations``
+method. The aggregation task runs every hour and takes the events from
+Elasticsearch.
+
+Queries
+-------
+
+Metrics for each ``recid`` and ``conceptrecid`` are aggregated and stored in
+"daily" documents. For example a record with ``recid: 12345``, will have
+documents like:
+
+.. code-block:: json
+
+    [
+      {
+        "_id": "12345-2018-01-01",
+        "_index": "stats-record-view-2018-01",
+        "_source": {
+          "timestamp": "2018-01-01T00:00:00",
+          "recid": "12345", "record_id": "5bad6b11-84ed-4946-86a9-2b614a63d2b4",
+          "communities": ["biosyslit"],
+          "count": 20, "unique_count": 15,
+        }
+      },
+      {
+        "_id": "12345-2018-01-02",
+        "_index": "stats-record-view-2018-01",
+        "_source": {
+          "timestamp": "2018-01-02T00:00:00",
+          "recid": "12345", "record_id": "5bad6b11-84ed-4946-86a9-2b614a63d2b4",
+          "communities": ["biosyslit"],
+          "count": 40, "unique_count": 30
+        }
+      }
+    ]
+
+Although that representation would be useful to display a histogram, it's
+obviously not very convenient to generate yearly or all-time statistics for a
+record. Invenio-Stats solves this by allowing to perform preconfigured queries
+to Elasticsearch which further aggregate metrics over periods of time by
+filtering and performing `Metrics Aggregations
+<https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-metrics.html>`_.
+
+The configured queries that are defined in
+``zenodo.modules.stats.registrations`` are:
+
+- ``record-view``: View statistics for specific record versions.
+- ``record-view-all-versions``: View statistics for all versions of a record.
+- ``record-download``: Download statistics for specific record versions.
+- ``record-download-all-versions``: Download statistics for all versions of a
+  record.
+
+These queries are exposed via a REST API which is accessible, only for users
+with the ``admin-access`` permission, at ``/api/stats``.
+
+Records integration
+-------------------
+
+While using Queries is enough to fetch individual record statistics, this is
+not an optimal solution for the most common use-cases. Making an Elasticsearch
+query everytime we want to display the total views, downloads, etc. of a record
+and all of its versions puts a lot of strain on Elasticsearch.
+
+Another use-case is that we want to sort records by views in search results.
+Since there is no way of doing an SQL-like ``JOIN`` in Elasticsearch, so that
+we could join the ``records`` and some aggregation of the ``stats-record-view``
+indices (though even if there was a way, that doesn't sound very efficient),
+there's only one solution left: to include the statistics inside the record's
+indexed document.
+
+Because of the above use-cases, we introduced in the ``records`` Elasticsearch
+mapping a ``_stats`` field. Every time a record is indexed (either through
+normal or bulk indexing), this field is being built by performing the necessary
+*sub-queries* to Elasticsearch, in order to fetch the all-time statistics of
+the record. These are:
+
+- ``views`` & ``unique_views``
+- ``downloads`` & ``unique_downloads``
+- ``volume`` & ``version_volume``
+- ``version_views`` & ``version_unique_views``
+- ``version_downloads`` & ``version_unique_downloads``
+
+Now that this pre-calculated information is part of the ``record`` index, we
+can use it in the following places:
+
+- For sorting search results (e.g. ``sort: '_stats.version_views'``)
+- At the record's page, i.e. in the statistics box in the sidebar
+- At the record's REST API responses and other serialization formats
+
+.. note::
+
+    This means that rendering a record's page or serializing a single record
+    now also depends on having both the database and Elasticsearch up and
+    running to get a complete representation. Since statistics are obviously
+    not as critical as the actual record's metadata, failure to fetch a record
+    from Elasticsearch will not raise an exception.
+
+Now that we know how to make the statistics of a record available, we have one
+final problem to solve: we need to keep the statistics updated! Although
+records are indexed from time to time because of user or system initiated
+editing/publishing, there has to be a regular updating mechanism that indexes
+records that might not have been necessarily "touched", but just "viewed" or
+"downloaded". The ``zenodo.modules.stats.tasks.update_record_statistics``
+Celery task is responsible for this job. It checks which records' statistics
+have been affected by Aggregations via checking the last two *bookmarks*
+created by each aggregation. Since these bookmarks' granularity is daily, we
+can only send a maximum of 1-2 days worth of affected records for bulk indexing
+every time the task runs. The task is kicked-off multiple times during a day by
+Celery Beat.