Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create scheduled job to process retention data #3074

Closed
antross opened this issue Oct 5, 2019 · 3 comments
Closed

Create scheduled job to process retention data #3074

antross opened this issue Oct 5, 2019 · 3 comments
Assignees
Milestone

Comments

@antross
Copy link
Member

antross commented Oct 5, 2019

Splitting out the processing as a separate task from the telemetry collection covered in #3056.

Processing

Due to some of the records containing redundant data, structured queries aren't suitable to generate the retention chart directly. Instead we'll run a daily scheduled web job to convert the records into a form that's easier to query.

As an example, using a rolling 4 day period (where _ marks data outside the 4-day period), the following table shows the "real" user (not included in actual data) and corresponding logged activity. It also shows which records would be ignored due to being redundant with data submitted later.

Day User Activity Record Redundant
1 A [1, _, _, _] X
1 C [1, _, _, _] X
2 A [1, 1, _, _]
2 C [1, 1, _, _] X
2 D [1, 0, _, _]
3 B [1, 0, 0, _] X
4 A [1, 0, 1, 1]
4 B [1, 1, 0, 0]

Or alternatively to show how the data aligns across days:

1A          [1, _, _, _] X
1C          [1, _, _, _] X
2A       [1, 1, _, _]
2C       [1, 1, _, _]    X
2D       [1, 0, _, _]
3B    [1, 0, 0, _]       X
4A [1, 0, 1, 1]
4B [1, 1, 0, 0]

Note that marking a record as redundant only means it matches the same usage pattern - it doesn't actually have to originate from the same user. Since record 4A ends in 1, 1, it needs to cancel out a record from day 2 starting with 1, 1 and a record from day 1 starting with 1. In this example the cancelled record from day 2 actually came from user C, but that's okay as balancing the numbers so each day's activity only gets counted once is what matters.

So to determine how many unique users were active for at least two days in this time period, we simply count how many non-redundant records have at least two 1s within the four day range. That's 2A, 4A, and 4B for a total of three unique users. The actual users who met this criteria were A, B, and C, but we don't need to know that - only how many of them there were.

@antross antross added this to the 1910-1 milestone Oct 5, 2019
@sarvaje
Copy link
Contributor

sarvaje commented Oct 7, 2019

@antross I'm going to take this one. I will let you know if I have some questions/problems hehehe

@antross
Copy link
Member Author

antross commented Oct 8, 2019

Great! I was hoping you would 😉

@sarvaje
Copy link
Contributor

sarvaje commented Oct 21, 2019

This is done in the telemetry repository via 1115e039dfa9bc474c9ce199b1ebd50a634316da

Closing..

@sarvaje sarvaje closed this as completed Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants