Implement Elasticsearch for easier analytics #243

adriaandotcom · 2020-02-12T18:03:25Z

This summer we are planning on working on a Elasticsearch app where we can ask all kind of analytics metrics from. We need it for a few features.

We hope @Jivings will help us!

Related issues

Macro-Jackson · 2020-03-23T11:52:08Z

We have quite some experience here. If you have questions we might be able to help out.

adriaandotcom · 2020-03-23T14:16:04Z

Thanks @Macro-Jackson! Will probably ask some questions when we start.

adriaandotcom · 2020-04-08T14:12:07Z

Schema

A specification of the requirements for this issue.

We have a few data points linked to every page view:

hostname (id linked to websites)
path (string like /product/123)
referrer (string of document.referrer without the protocol and query eg: example.com/page)
session_uuid (uuid of session)
source
- utm_source, ref, source
- utm_medium, medium
- utm_campaign, campaign
- utm_content
- utm_term
user agent
- browser_name (string)
- browser_version (string? because it can be 1.1.1)
- os_name (string)
- os_version (string? because it can be 1.1.1)
device_type (string)
- desktop
- tablet
- mobile
country (2 letter ISO)
unique (boolean)
scolled* (integer from 0 to 100)
duration* (integer in seconds)
is_robot (boolean)
server_id (string for which server is came from eg. 1984.is, leaseweb_1)
script_id (string for which script was used eg. hello.js, custom_sri_v2)

* These types are added when the visitor closes the page via session_uuid.

Data queries

With this schema we want to get answers on these questions:

Visits over time grouped by days/weeks*/years* (unique and non unique)
List of referrers sorted by most visits (unique* and non unique)
List of paths sorted by most visits (unique* and non unique)
List of paths sorted by a formula based on scrolled and duration *
List of user agents sorted by most visits (unique* and non unique)
List of countries sorted by most visits (unique* and non unique)
Get bounce rate for pages

For referrers

Visits over time based on a referrer
List of paths sorted by most visits based on a referrer *
List of paths sorted by a formula based on scrolled and duration based on a referrer *
List of user agents sorted by most visits based on a referrer *
List of countries sorted by most visits based on a referrer *

For paths

Visits over time based on a path
List of referrer sorted by most visits based on a path *
List of user agents sorted by most visits based on a path *
List of countries sorted by most visits based on a path *

And this for user agents, countries, ...

* Questions we can't answer in our current system

All these queries should keep the a time zone in mind. If a user gets the same data for Monday from New York it's probably different from the person getting it for Amsterdam. In the current app we solved this by aggregating per hour. That way we can always get all records between certain hours (a.k.a. time zones).

How?

I'm thinking of setting up a different server with just ES (Elastic Search) and a Node.js app which will serve as an API. So all data will be aggregated on the ES server and send back to the main server where we display the data.

I think the data model is pretty simple. I would love some help to setup ES, adding the model into ES, get the data in, setup all queries needed for above questions, connect queries with endpoints of Node.js app on the ES server. Connecting the Node.js app with the main server app is something that I would like to do myself.

Would be a nice-to-have if we could add authentication on the ES server as well, so people and our front end can "talk" directly to the ES server.

adriaandotcom · 2020-04-17T11:40:11Z

Added two utm codes to the source:

utm_content
utm_term

Sorry if this is a bit late, but better now then after we built it, right?

Jivings · 2020-04-19T06:02:32Z

Do you have any thoughts about this;

List of paths sorted by a formula based on scrolled and duration based on a referrer *

I'm terming this "most popular paths", I think that's what it means essentially yes?

If possible it would be better to calculate this at index time if you have an idea of how it will work.

adriaandotcom · 2020-04-19T11:36:03Z

List of paths sorted by a formula based on scrolled and duration based on a referrer *

It means it will use the two other variables: scrolled and duration. The output will be a list of "best quality pages" or something. The list will live next to the list of "most popular paths".

I can image the formula being something like this:

scrolled * Math.min(duration, 300) * 0.5

The list items are comparable to each other so if the highest score is 349 then we convert that to 100% and the other items relative to that. So the final list would be something like:

page	score_raw	score_percentage
/	349	100%
/contact	290	83%
/feedback	50	14%

If possible it would be better to calculate this at index time if you have an idea of how it will work.

The scrolled and duration variables will be added later, so not sure if the indexing happens then? It's also relative to the other list items, so maybe it should just spit out the score_raw and in the Node.js app we at the score_percentage. The only thing that's important is that we can sort on the score_raw. But the formula will change in the future, I'm sure.

So if we can make the sort work on a formula on scrolled and duration without storing the output in ES, that would be great. But if we need to store it, that's also fine. Then we need to update old fields when the formula changes.

Jivings · 2020-04-20T04:07:59Z

Okay I understand the score, and we should definitely store it for performance reasons. We are reindexing that document when the scrolled and duration values come in anyway, so no problem to add the score at that point.

But can you explain why you need the %? Maybe give me an example of what you're trying to show in the UI? Is it;

The top N paths by quality
The top 5th (/25th/50th etc) percentile of paths

If 1 is what you want then the score_percentage is just some way of represnting the score on the FE, then we can do a regular aggregation to get the top N paths, and also return the total & sum of all the path scores so you can calculate the % at query time.

If you want to get the percentile (I could see this maybe is useful for finding outliers, like REALLY popular or unpopular paths), then we can support that via a percentile aggregation on the score_raw field.

adriaandotcom · 2020-04-20T10:33:25Z

I think it's closer to 1 for now. Quality would be nice if we represent that with a percentage but I'm totally fine to do this in the app itself. Then we just have a score relative to the other pages for that period.

The result in the dashboard would be an indicator (not sure yet on how to show this, maybe with green, orange, or red color dot) next to the page name in the list. Something simple where people can see: "ah, this page is doing well." It's should be simple after all.

adriaandotcom · 2020-08-26T08:50:13Z

Elasticsearch is up and running. Most bugs are in the frontend and we have open issues for those. No need for this issue anymore. 🥳

adriaandotcom added customer request Tasks created by customers via our website enterprise All tasks related to enterprise customers labels Feb 12, 2020

adriaandotcom self-assigned this Feb 12, 2020

adriaandotcom added this to Features & bugs in Public roadmap via automation Feb 12, 2020

This was referenced Apr 10, 2020

Show each page's Referrals #278

Closed

Graphs should not start at zero in websites overview #287

Closed

adriaandotcom changed the title ~~Implement Elastic Search for easier analytics~~ Implement Elasticsearch for easier analytics Apr 20, 2020

adriaandotcom added the drill down Elasticsearch features! label May 3, 2020

adriaandotcom added beta Beta version issues prio Do this first labels Aug 26, 2020

adriaandotcom closed this as completed Aug 26, 2020

Public roadmap automation moved this from Features & bugs to Implemented Aug 26, 2020

adriaandotcom added the emailed to customers label Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Elasticsearch for easier analytics #243

Implement Elasticsearch for easier analytics #243

adriaandotcom commented Feb 12, 2020 •

edited

Macro-Jackson commented Mar 23, 2020

adriaandotcom commented Mar 23, 2020

adriaandotcom commented Apr 8, 2020 •

edited

adriaandotcom commented Apr 17, 2020 •

edited

Jivings commented Apr 19, 2020

adriaandotcom commented Apr 19, 2020

Jivings commented Apr 20, 2020 •

edited

adriaandotcom commented Apr 20, 2020 •

edited

adriaandotcom commented Aug 26, 2020

Implement Elasticsearch for easier analytics #243

Implement Elasticsearch for easier analytics #243

Comments

adriaandotcom commented Feb 12, 2020 • edited

Macro-Jackson commented Mar 23, 2020

adriaandotcom commented Mar 23, 2020

adriaandotcom commented Apr 8, 2020 • edited

Schema

Data queries

How?

adriaandotcom commented Apr 17, 2020 • edited

Jivings commented Apr 19, 2020

adriaandotcom commented Apr 19, 2020

Jivings commented Apr 20, 2020 • edited

adriaandotcom commented Apr 20, 2020 • edited

adriaandotcom commented Aug 26, 2020

adriaandotcom commented Feb 12, 2020 •

edited

adriaandotcom commented Apr 8, 2020 •

edited

adriaandotcom commented Apr 17, 2020 •

edited

Jivings commented Apr 20, 2020 •

edited

adriaandotcom commented Apr 20, 2020 •

edited