Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry: prune the amount of data stored #9328

Closed
humitos opened this issue Jun 13, 2022 · 5 comments
Closed

Telemetry: prune the amount of data stored #9328

humitos opened this issue Jun 13, 2022 · 5 comments
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code

Comments

@humitos
Copy link
Member

humitos commented Jun 13, 2022

When we implemented telemetry database to save BuildData objects we didn't implement the prune of it. After some days/weeks, we experimented with growth of 3Gb in data.

We should prune the BuildData objects with some useful logic. We talked about pruning based on:

  1. only creation datetime (e.g. store last 6 months)
  2. spam score (e.g. only save BuildData if spam score is less than 150)
  3. fixed number of objects based on project/version (e.g. save 100 objects per project/version)

Each of them has its own downsides and we can talk/discuss a good implementation. It would be good if we can keep it simple and store only the data we need for the answers we are looking for.

Worth to note that we created this issue because a PagerDuty alarm was triggered due to the lack of extra free space in the database.

@humitos humitos added Improvement Minor improvement to code Accepted Accepted issue on our roadmap labels Jun 13, 2022
@ericholscher
Copy link
Member

ericholscher commented Jun 13, 2022

Yea, I think the big question I have is "what valuable queries are we running against this data?" I think the answer currently is "we aren't really using this data yet", but I think knowing how we think it will be useful is important in making sure we can only keep the data we care about. Alternatively, we could archive it somewhere that isn't stored in postgres (eg. a monthly pg_dump file or csv in S3) so that we can query it if we need to, but we aren't paying to store it in a queryable form.

We do this with our ads data, but we almost never go back and query it, so I'm not sure how useful it is to have archived old data, we can probably just delete it. The ads data we do this because it's billing data, but BuildData is not as important.

@humitos
Copy link
Member Author

humitos commented Jul 4, 2022

@ericholscher

Yea, I think the big question I have is "what valuable queries are we running against this data?" I think the answer currently is "we aren't really using this data yet", but I think knowing how we think it will be useful is important in making sure we can only keep the data we care about

These are two real cases where I used it and it was useful:

Alternatively, we could archive it somewhere that isn't stored in postgres (eg. a monthly pg_dump file or csv in S3) so that we can query it if we need to, but we aren't paying to store it in a queryable form.

I don't think it is worth the effort because I don't think we will come back to pretty old data. The data we want to query to make decisions shouldn't be too old.

@humitos
Copy link
Member Author

humitos commented Jul 4, 2022

As a first step, to avoid it growing too much, I'd save X months of data. Then, we probably want to save more data only for "active projects" --which are the ones that we care about the most.

As a reference, ~90 days of data is ~8Gb:

docs=> SELECT pg_size_pretty(pg_database_size('telemetry')); 
 pg_size_pretty 
----------------
 8014 MB
(1 row)

Quick math:

  • 6 months of data is 30Gb
  • 12 months of data is 60Gb (about the same size of docs db)

@ericholscher
Copy link
Member

Sounds like we should probably just keep the last 90 days for now at a minimum?

@humitos
Copy link
Member Author

humitos commented Jul 6, 2022

ha ha! I don't how my math works... "90 days is 8Gb and 6 months is 30Gb" 🤔

Sounds like we should probably just keep the last 90 days for now at a minimum?

Yes. I'd start with 180 days --which should be ~16Gb and I think it's acceptable. We can keep tuning it later as we start using this data more.

humitos added a commit that referenced this issue Jul 6, 2022
Define a task to delete old `BuildData` older than
`RTD_TELEMETRY_DATA_RETENTION_DAYS`, which is set to 180 days for now. This task
is configured to be run every day at 2AM.

Related #9328
humitos added a commit that referenced this issue Jul 6, 2022
Define a task to delete old `BuildData` older than
`RTD_TELEMETRY_DATA_RETENTION_DAYS`, which is set to 180 days for now. This task
is configured to be run every day at 2AM.

Related #9328
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code
Projects
Archived in project
Development

No branches or pull requests

2 participants