Skip to content

Commit

Permalink
Merge 9f0c975 into 4e6a9c8
Browse files Browse the repository at this point in the history
  • Loading branch information
slint committed Jul 24, 2018
2 parents 4e6a9c8 + 9f0c975 commit 7f2bd71
Showing 1 changed file with 209 additions and 84 deletions.
293 changes: 209 additions & 84 deletions docs/architecture/statistics.rst
Expand Up @@ -22,94 +22,109 @@
as an Intergovernmental Organization or submit itself to any jurisdiction.

Statistics
----------
==========

Events
~~~~~~
------

The usage statistics of the Zenodo records are generated from two different types of events:
The usage statistics of the Zenodo records are generated from two different
types of events:

* ``record-view``: this event is related to the view of a record and it's bounded to the ``record_viewed`` signal, which is
emitted when one the following endpoints is accessed:
* ``record-view``: this event is related to the view of a record and it's
bounded to the ``record_viewed`` signal, which is emitted when one of the
following endpoints is accessed:

* ``/record/<recid>``
* ``/record/<recid>/export/<format>``

* ``file-download``: this event is related to the download a file of a record and it's is bounded to
the ``file_downloaded`` signal, which is emitted when one the following endpoints is accessed:
* ``file-download``: this event is related to the file download from a record
and it's bounded to the ``file_downloaded`` signal, which is emitted when
one the following endpoints is accessed:

* ``/record/<recid>/files/<filename>``
* ``/api/record/<recid>/files/<filename>``

Every time a user views a record or downloads a file, the corresponding signal is emitted and the corresponding
event is created. In case an event fails to be emitted due to an exception raised in an event builder or to a failure
in sending the message to RabbitMQ, the request is not affected and a warning message is logged.
Every time a user views a record or downloads a file, the corresponding signal
is emitted and the corresponding event is created. In case an event fails to be
emitted due to an exception raised in an event builder or to a failure in
sending the message to RabbitMQ, the request is not affected and a warning
message is logged.

At the creation time each event keeps track of the following fields:

* record_id
* recid
* conceptrecid
* doi
* conceptdoi
* access_right
* resource_type
* communities
* owners
* timestamp
* referrer

We also capture the following user related fields for a limited period of time, i.e. until the event is processed
and the data are anonymized:

* ip_address
* user_agent
* user_id
* session_id

There are also some fields which are event specific.
For the ``file-download`` event we track:

* bucket_id
* file_id
* file_key
* size
* ``record_id``
* ``recid``
* ``conceptrecid``
* ``doi``
* ``conceptdoi``
* ``access_right``
* ``resource_type``
* ``communities``
* ``owners``
* ``timestamp``
* ``referrer``

We also capture the following user related fields for a limited period of time,
i.e. until the event is processed and the data are anonymized:

* ``ip_address``
* ``user_agent``
* ``user_id``
* ``session_id``

There are also some fields which are event specific. For the ``file-download``
event we track:

* ``bucket_id``
* ``file_id``
* ``file_key``
* ``size``

While for the ``record-view`` event we track:

* pid_type
* pid_value

After an event is created, it is sent to a dedicated RabbitMQ queue. Each type of event has its own queue, so
different types of events can be processed separately.
The events which are in the queue are then consumed and processed by a Celery task. In this step the
flags ``is_machine`` and ``is_robot`` are added, the user is anonymized and the double-clicks are deleted. During the data
anonymization process the user related fields are deleted and replaced by the ``visitor_id`` and the ``unique_session_id``.
In case an event fails to be processed (e.g. malformed IP address), the event is skipped/lost and a warning message
is logged.
After the processing, the robot events are discarded for space saving, since they are not relevant for the statistics.
All of the other processed events are saved in Elasticsearch.
* ``pid_type``
* ``pid_value``

After an event is created, it is sent to a dedicated RabbitMQ queue. Each type
of event has its own queue, so different types of events can be processed
separately. The events which are in the queue are then consumed and processed
by a Celery task. In this step the flags ``is_machine`` and ``is_robot`` are
added, the user is anonymized and the double-clicks are deleted. During the
data anonymization process the user related fields are deleted and replaced by
the ``visitor_id`` and the ``unique_session_id``. In case an event fails to be
processed (e.g. malformed IP address), the event is skipped/lost and a warning
message is logged. After the processing, the robot events are discarded for
space savings, since they are not relevant for the statistics. All of the other
processed events are saved in Elasticsearch.

Aggregations
~~~~~~~~~~~~

All the events generated by the record views or downloads are aggregated in several ways to produce daily statistics.
Here you can find the list of the different types of aggregations used by Zenodo:

* ``record-view-agg``: this aggregation is applied to the ``record-view`` events and it calculates the daily views and
unique views of a specific version of a record;
* ``record-view-all-versions-agg``: this aggregation is applied to the ``record-view`` events and it calculates the daily
views and unique views of all versions of a record;
* ``record-download-agg``: this aggregation is applied to the ``file-download`` events and it calculates the daily downloads
and unique downloads of a specific version of a record;
* ``record-download-all-versions-agg``: this aggregation is applied to the ``file-download`` events and it calculates
the daily downloads and unique downloads of all versions of a record.

Both the ``record-view-agg`` and the ``record-view-all-versions-agg`` are applied to the same ``record-view`` events
and the documents they produce are stored within the same index. The difference between these two aggregations is
that, while the first one aggregates the events by the ``recid``, the second one does the aggregation by the
``conceptrecid``. This leads to two different results: in the first case we have the statistics for a single version of
a record, while in the second case we have the statistics for all the versions of a record.
------------

All the events generated by the record views or downloads are aggregated in
several ways to produce daily statistics. Here you can find the list of the
different types of aggregations used by Zenodo:

* ``record-view-agg``: this aggregation is applied to the ``record-view``
events and it calculates the daily views and unique views of a specific
version of a record;
* ``record-view-all-versions-agg``: this aggregation is applied to the
``record-view`` events and it calculates the daily views and unique views of
all versions of a record;
* ``record-download-agg``: this aggregation is applied to the ``file-download``
events and it calculates the daily downloads and unique downloads of a
specific version of a record;
* ``record-download-all-versions-agg``: this aggregation is applied to the
``file-download`` events and it calculates the daily downloads and unique
downloads of all versions of a record.

Both the ``record-view-agg`` and the ``record-view-all-versions-agg`` are
applied to the same ``record-view`` events and the aggregate documents they
produce are stored in the same indices/alias (``stats-record-view``). The
difference between these two aggregations is that, while the first one
aggregates the events by ``recid``, the second one does the aggregation by
``conceptrecid``. This leads to two different results: in the first case we
have the statistics for a single version of a record, while in the second case
we have the statistics for all the versions of a record.

For example, let's say that we have the following ``record-view`` events:

Expand Down Expand Up @@ -137,7 +152,8 @@ For example, let's say that we have the following ``record-view`` events:
}
The result of the ``record-view-agg`` will be two documents, one for each version of the record:
The result of the ``record-view-agg`` will be two documents, one for each
version of the record:

.. code-block:: python
Expand All @@ -161,8 +177,8 @@ The result of the ``record-view-agg`` will be two documents, one for each versio
...
}
The result of ``record-view-all-versions-agg`` will be one document which summarize the statistics of both versions
of the record:
The result of ``record-view-all-versions-agg`` will be one document which
summarize the statistics of both versions of the record:

.. code-block:: python
Expand All @@ -177,16 +193,125 @@ of the record:
}
The same happens for the ``record-download-agg`` and the ``record-download-all-versions-agg``, which are applied to the
``file-download`` events.

In order to count the total number of unique views (and unique downloads) of a record, it's necessary to identify each
1 hour user-session with a unique id, called ``unique_session_id``. All the views (and all the downloads) made from
the same user within the same one hour session have the same ``unique_session_id``. In this way we can easily count
the total number of unique views (or unique downloads) of a record as the cardinality of the ``unique_session_id``
present in the events related to the record.

All the new aggregations are registered via the ``register_aggregations`` method. The aggregation task runs every hour
and takes the events from Elasticsearch.


The same happens for the ``record-download-agg`` and the
``record-download-all-versions-agg``, which are applied to the
``file-download`` events and end up in the ``stats-file-download``
indices/alias.

In order to count the total number of unique views (and unique downloads) of a
record, it's necessary to identify each 1 hour user-session with a unique id,
called ``unique_session_id``. All the views (and all the downloads) made from
the same user within the same one hour session have the same
``unique_session_id``. In this way we can easily count the total number of
unique views (or unique downloads) of a record as the cardinality of the
``unique_session_id`` present in the events related to the record.

All the new aggregations are registered via the ``register_aggregations``
method. The aggregation task runs every hour and takes the events from
Elasticsearch.

Queries
-------

Metrics for each ``recid`` and ``conceptrecid`` are aggregated and stored in
"daily" documents. For example a record with ``recid: 12345``, will have
documents like:

.. code-block:: json
[
{
"_id": "12345-2018-01-01",
"_index": "stats-record-view-2018-01",
"_source": {
"timestamp": "2018-01-01T00:00:00",
"recid": "12345", "record_id": "5bad6b11-84ed-4946-86a9-2b614a63d2b4",
"communities": ["biosyslit"],
"count": 20, "unique_count": 15,
}
},
{
"_id": "12345-2018-01-02",
"_index": "stats-record-view-2018-01",
"_source": {
"timestamp": "2018-01-02T00:00:00",
"recid": "12345", "record_id": "5bad6b11-84ed-4946-86a9-2b614a63d2b4",
"communities": ["biosyslit"],
"count": 40, "unique_count": 30
}
}
]
Although that representation would be useful to display a histogram, it's
obviously not very convenient to generate yearly or all-time statistics for a
record. Invenio-Stats solves this by allowing to perform preconfigured queries
to Elasticsearch which further aggregate metrics over periods of time by
filtering and performing `Metrics Aggregations
<https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-metrics.html>`_.

The configured queries that are defined in
``zenodo.modules.stats.registrations`` are:

- ``record-view``: View statistics for specific record versions.
- ``record-view-all-versions``: View statistics for all versions of a record.
- ``record-download``: Download statistics for specific record versions.
- ``record-download-all-versions``: Download statistics for all versions of a
record.

These queries are exposed via a REST API which is accessible, only for users
with the ``admin-access`` permission, at ``/api/stats``.

Records integration
-------------------

While using Queries is enough to fetch individual record statistics, this is
not an optimal solution for the most common use-cases. Making an Elasticsearch
query everytime we want to display the total views, downloads, etc. of a record
and all of its versions puts a lot of strain on Elasticsearch.

Another use-case is that we want to sort records by views in search results.
Since there is no way of doing an SQL-like ``JOIN`` in Elasticsearch, so that
we could join the ``records`` and some aggregation of the ``stats-record-view``
indices (though even if there was a way, that doesn't sound very efficient),
there's only one solution left: to include the statistics inside the record's
indexed document.

Because of the above use-cases, we introduced in the ``records`` Elasticsearch
mapping a ``_stats`` field. Every time a record is indexed (either through
normal or bulk indexing), this field is being built by performing the necessary
*sub-queries* to Elasticsearch, in order to fetch the all-time statistics of
the record. These are:

- ``views`` & ``unique_views``
- ``downloads`` & ``unique_downloads``
- ``volume`` & ``version_volume``
- ``version_views`` & ``version_unique_views``
- ``version_downloads`` & ``version_unique_downloads``

Now that this pre-calculated information is part of the ``record`` index, we
can use it in the following places:

- For sorting search results (e.g. ``sort: '_stats.version_views'``)
- At the record's page, i.e. in the statistics box in the sidebar
- At the record's REST API responses and other serialization formats

.. note::

This means that rendering a record's page or serializing a single record
now also depends on having both the database and Elasticsearch up and
running to get a complete representation. Since statistics are obviously
not as critical as the actual record's metadata, failure to fetch a record
from Elasticsearch will not raise an exception.

Now that we know how to make the statistics of a record available, we have one
final problem to solve: we need to keep the statistics updated! Although
records are indexed from time to time because of user or system initiated
editing/publishing, there has to be a regular updating mechanism that indexes
records that might not have been necessarily "touched", but just "viewed" or
"downloaded". The ``zenodo.modules.stats.tasks.update_record_statistics``
Celery task is responsible for this job. It checks which records' statistics
have been affected by Aggregations via checking the last two *bookmarks*
created by each aggregation. Since these bookmarks' granularity is daily, we
can only send a maximum of 1-2 days worth of affected records for bulk indexing
every time the task runs. The task is kicked-off multiple times during a day by
Celery Beat.

0 comments on commit 7f2bd71

Please sign in to comment.