Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Elasticsearch for easier analytics #243

Closed
adriaandotcom opened this issue Feb 12, 2020 · 9 comments
Closed

Implement Elasticsearch for easier analytics #243

adriaandotcom opened this issue Feb 12, 2020 · 9 comments
Assignees
Labels
beta Beta version issues customer request Tasks created by customers via our website drill down Elasticsearch features! emailed to customers enterprise All tasks related to enterprise customers prio Do this first

Comments

@adriaandotcom
Copy link
Contributor

adriaandotcom commented Feb 12, 2020

This summer we are planning on working on a Elasticsearch app where we can ask all kind of analytics metrics from. We need it for a few features.

We hope @Jivings will help us!

Related issues

@adriaandotcom adriaandotcom added customer request Tasks created by customers via our website enterprise All tasks related to enterprise customers labels Feb 12, 2020
@adriaandotcom adriaandotcom self-assigned this Feb 12, 2020
@adriaandotcom adriaandotcom added this to Features & bugs in Public roadmap via automation Feb 12, 2020
@Macro-Jackson
Copy link

We have quite some experience here. If you have questions we might be able to help out.

@adriaandotcom
Copy link
Contributor Author

Thanks @Macro-Jackson! Will probably ask some questions when we start.

@adriaandotcom
Copy link
Contributor Author

adriaandotcom commented Apr 8, 2020

Schema

A specification of the requirements for this issue.

We have a few data points linked to every page view:

  1. hostname (id linked to websites)
  2. path (string like /product/123)
  3. referrer (string of document.referrer without the protocol and query eg: example.com/page)
  4. session_uuid (uuid of session)
  5. source
    • utm_source, ref, source
    • utm_medium, medium
    • utm_campaign, campaign
    • utm_content
    • utm_term
  6. user agent
    • browser_name (string)
    • browser_version (string? because it can be 1.1.1)
    • os_name (string)
    • os_version (string? because it can be 1.1.1)
  7. device_type (string)
    • desktop
    • tablet
    • mobile
  8. country (2 letter ISO)
  9. unique (boolean)
  10. scolled* (integer from 0 to 100)
  11. duration* (integer in seconds)
  12. is_robot (boolean)
  13. server_id (string for which server is came from eg. 1984.is, leaseweb_1)
  14. script_id (string for which script was used eg. hello.js, custom_sri_v2)

* These types are added when the visitor closes the page via session_uuid.

Data queries

With this schema we want to get answers on these questions:

  • Visits over time grouped by days/weeks*/years* (unique and non unique)
  • List of referrers sorted by most visits (unique* and non unique)
  • List of paths sorted by most visits (unique* and non unique)
  • List of paths sorted by a formula based on scrolled and duration *
  • List of user agents sorted by most visits (unique* and non unique)
  • List of countries sorted by most visits (unique* and non unique)
  • Get bounce rate for pages

For referrers

  • Visits over time based on a referrer
  • List of paths sorted by most visits based on a referrer *
  • List of paths sorted by a formula based on scrolled and duration based on a referrer *
  • List of user agents sorted by most visits based on a referrer *
  • List of countries sorted by most visits based on a referrer *

For paths

  • Visits over time based on a path
  • List of referrer sorted by most visits based on a path *
  • List of user agents sorted by most visits based on a path *
  • List of countries sorted by most visits based on a path *

And this for user agents, countries, ...

* Questions we can't answer in our current system

All these queries should keep the a time zone in mind. If a user gets the same data for Monday from New York it's probably different from the person getting it for Amsterdam. In the current app we solved this by aggregating per hour. That way we can always get all records between certain hours (a.k.a. time zones).

How?

I'm thinking of setting up a different server with just ES (Elastic Search) and a Node.js app which will serve as an API. So all data will be aggregated on the ES server and send back to the main server where we display the data.

I think the data model is pretty simple. I would love some help to setup ES, adding the model into ES, get the data in, setup all queries needed for above questions, connect queries with endpoints of Node.js app on the ES server. Connecting the Node.js app with the main server app is something that I would like to do myself.

Would be a nice-to-have if we could add authentication on the ES server as well, so people and our front end can "talk" directly to the ES server.

@adriaandotcom
Copy link
Contributor Author

adriaandotcom commented Apr 17, 2020

Added two utm codes to the source:

  • utm_content
  • utm_term

Sorry if this is a bit late, but better now then after we built it, right?

@Jivings
Copy link

Jivings commented Apr 19, 2020

Do you have any thoughts about this;

List of paths sorted by a formula based on scrolled and duration based on a referrer *

I'm terming this "most popular paths", I think that's what it means essentially yes?

If possible it would be better to calculate this at index time if you have an idea of how it will work.

@adriaandotcom
Copy link
Contributor Author

List of paths sorted by a formula based on scrolled and duration based on a referrer *

It means it will use the two other variables: scrolled and duration. The output will be a list of "best quality pages" or something. The list will live next to the list of "most popular paths".

I can image the formula being something like this:

scrolled * Math.min(duration, 300) * 0.5

The list items are comparable to each other so if the highest score is 349 then we convert that to 100% and the other items relative to that. So the final list would be something like:

page score_raw score_percentage
/ 349 100%
/contact 290 83%
/feedback 50 14%

If possible it would be better to calculate this at index time if you have an idea of how it will work.

The scrolled and duration variables will be added later, so not sure if the indexing happens then? It's also relative to the other list items, so maybe it should just spit out the score_raw and in the Node.js app we at the score_percentage. The only thing that's important is that we can sort on the score_raw. But the formula will change in the future, I'm sure.

So if we can make the sort work on a formula on scrolled and duration without storing the output in ES, that would be great. But if we need to store it, that's also fine. Then we need to update old fields when the formula changes.

@Jivings
Copy link

Jivings commented Apr 20, 2020

Okay I understand the score, and we should definitely store it for performance reasons. We are reindexing that document when the scrolled and duration values come in anyway, so no problem to add the score at that point.

But can you explain why you need the %? Maybe give me an example of what you're trying to show in the UI? Is it;

  1. The top N paths by quality
  2. The top 5th (/25th/50th etc) percentile of paths

If 1 is what you want then the score_percentage is just some way of represnting the score on the FE, then we can do a regular aggregation to get the top N paths, and also return the total & sum of all the path scores so you can calculate the % at query time.

If you want to get the percentile (I could see this maybe is useful for finding outliers, like REALLY popular or unpopular paths), then we can support that via a percentile aggregation on the score_raw field.

@adriaandotcom
Copy link
Contributor Author

adriaandotcom commented Apr 20, 2020

I think it's closer to 1 for now. Quality would be nice if we represent that with a percentage but I'm totally fine to do this in the app itself. Then we just have a score relative to the other pages for that period.

The result in the dashboard would be an indicator (not sure yet on how to show this, maybe with green, orange, or red color dot) next to the page name in the list. Something simple where people can see: "ah, this page is doing well." It's should be simple after all.

@adriaandotcom adriaandotcom changed the title Implement Elastic Search for easier analytics Implement Elasticsearch for easier analytics Apr 20, 2020
@adriaandotcom adriaandotcom added the drill down Elasticsearch features! label May 3, 2020
@adriaandotcom adriaandotcom added beta Beta version issues prio Do this first labels Aug 26, 2020
@adriaandotcom
Copy link
Contributor Author

Elasticsearch is up and running. Most bugs are in the frontend and we have open issues for those. No need for this issue anymore. 🥳

Public roadmap automation moved this from Features & bugs to Implemented Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta Beta version issues customer request Tasks created by customers via our website drill down Elasticsearch features! emailed to customers enterprise All tasks related to enterprise customers prio Do this first
Projects
Public roadmap
  
Implemented
Development

No branches or pull requests

3 participants