Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agent for request to artifacthub.io/api/chartsvc #66

Merged
merged 2 commits into from
Aug 3, 2021

Conversation

tuananhnguyen-ct
Copy link
Contributor

Change the URL from hub.helm.sh to artifacthub.io so we can skip the redirection

Add the agent to fix #64

It seems only this endpoint requires a user agent.

@sstarcher sstarcher merged commit b62ecef into sstarcher:master Aug 3, 2021
@sstarcher
Copy link
Owner

Thanks

@tegioz
Copy link

tegioz commented Sep 10, 2021

Hi @sstarcher @tuananhnguyen-ct!

This is Sergio, from Artifact Hub 👋 I'm so happy I stumbled across this PR 😄

A bit of context first

When the Helm Hub migrated to Artifact Hub, we started receiving a lot of search requests using the legacy search API endpoint. We realized that many of them were constantly (many times per second) searching for the same term, and in many cases for terms that were not even related to the kind of content available in the Hub (like random domains and other stuff). Requests were coming from a lot of different sources, from multiple cloud providers. We thought some software was doing those requests in an automated fashion, and we did our best to protect the service by imposing some rate limits and blocking most of them. Blocking some user-agents in that endpoint was one of the measures taken, which affected some of your users. After the fix in this PR was released, the number of requests that made it to our backend servers started growing again.

To give you an idea of the current situation, we are receiving more than 108 million search requests per day in that endpoint, which is more than 1.25 thousand per second 😅. The vast majority of those requests are being blocked, so the impact on artifacthub.io is minimal. However, we still need to process and serve them from the CDN, which means at the moment a cost of ~US2.5k per month. Also, if we had to disable temporarily some of that protection, which happened once, we could run into some issues processing that volume of searches. In addition to the search requests, we also receive thousands of requests daily for some packages, like this particular chart, which seems related.

How can we solve this

I realize many of your users many not even be aware that they are running into some issues, because even though the requests are being blocked we continue receiving them. It'd be great if we could work together on optimizing how helm-exporter uses the Artifact Hub API. The goal would be to reduce as much as possible the number of search and packages requests.

I'm not very familiar with the helm-exporter code base, but I've just taken a quick look to it and I've noticed some points that could help us to start the conversation:

  • When users request to collect stats periodically in helm-exporter, for each chart installed detected it tries to get the latest version from Artifact Hub by default. If the users opt for getting the stats every 5 seconds, you'd be searching AH for each of the charts at that frequency. I suspect this is what leads to our problem when we consider a large number of helm-exporter deployments with many charts installed. Please note that our trackers run every 30 minutes, so all those searches return the same results most of the time.
  • The latest version available is the one returned by the search, so no further requests to the package endpoint should be needed to get all versions available and pick the latest.
  • Artifact Hub provides support for webhook notifications when a new version of a chart is released, without having to poll periodically (not directly related to the problem but I just wanted to let you know).

We'd be more than happy to provide a specific endpoint for your use case if that helps. We did something similar for Harbor replication some time ago for similar reasons. Maybe we could generate a list of all available packages and the latest version for each of them, as it looks like the end goal of you interaction with the Artifact Hub API is to get the latest version of a given chart. When helm-exporter starts, you could fetch them all once in one request using that endpoint to search locally as often as you need. You could update this full list periodically, but not based on the frequency requested by your users, but a predefined one that matches how often it actually gets updated. We could update that dump every hour or a few hours, we can discuss it. This is just an idea, looking forward to hear your thoughts about it or any other alternatives you can think of.

I understand it's not your responsibility how users use helm-exporter, but we'd really appreciate if we could work on this together to find a solution that benefit both your users and artifacthub.io users 🙂

Thank you very much in advance for your time and help!

CC: @caniszczyk @mattfarina

@sstarcher
Copy link
Owner

@tegioz that all makes sense to me. I'm very sorry I have caused you some headache. I no longer actively work on this project, but due to this causing you some headache I would be happy to take some time out and assist.

Let me know if you think it's reasonable for you to develop a separate endpoint.

@tegioz
Copy link

tegioz commented Sep 10, 2021

Thank you for getting back to me so quickly @sstarcher!

No worries, I'm happy we may have found the possible cause and I really appreciate your offer to help 🙂

Adding the endpoint suggested wouldn't be a problem at all. Let's summarize to check it all makes sense and we'll start working on our part as soon as possible:

  • We would generate a dump of all packages available, that could be fetched using a new endpoint. That dump could be generated every hour or few hours.
  • helm-exporter could fetch this dump on start to search locally using it as often as needed. The local copy of the dump could be updated calling the new endpoint periodically at the agreed frequency. It could be helpful to cache the dump on a file and skip the update on start if it's still valid, just in case helm-exporter is launched manually multiple times, but not sure if it's a common pattern or really needed.

Some questions:

  • Frequency of the dump: how fast do you need to realize that a new version is available? We check repos for updates every 30 mins, but most of the time there won't be updates for a given one so frequently. Would something like 6 or 12 hours be fine?

  • Format of the dump: would something like the snippet below work? We can add some extra fields or remove the ones you think won't need. We can also use other format than json that makes searching easier. The combination of repo url and chart name could yield even more accurate results when searching locally than what you are getting through the search API.

[
  {
    "name": "chart name",
    "latest_version": "1.0.0",
    "repo": {
      "name": "repo name",
      "url": "repo url"
    }
  }
]

Thanks again!

@sstarcher
Copy link
Owner

Have you done an analysis on the user agent? Anyone running this should have a golang user agent. That would give you more of an idea of this project and other golang projects being the cause. We are using req.Header.Set("User-Agent", "Go-http-client/1.1") currently it looks like.

One thing we should probably do is have this project set its own user agent so it can be identified easier in the future.

I can't speak for all users, but I think a 6 hour rate would be reasonable. I do agree to have the chart name and the repo URL both would be very helpful.

Do you know the overall size of the data structure if we were to have it in json with the above info? I don't want to put a ton of burden on the client end, but if it's something we can easily store in a few mb of memory that would make searching on the client easy.

@tegioz
Copy link

tegioz commented Sep 10, 2021

Have you done an analysis on the user agent? Anyone running this should have a golang user agent. That would give you more of an idea of this project and other golang projects being the cause. We are using req.Header.Set("User-Agent", "Go-http-client/1.1") currently it looks like.

Most of the requests have that user-agent, that's actually one the filters we have in place. It was set to Go-http-client/2.0 before (the default before you started setting it explicitly), and I've updated it today to Go-http-client/1.1 when I noticed the growth and looked into it. We had to opt for such strict measure unfortunately. We are also applying a rate limit of 100 searches every 5 minutes per IP (for other user agents). The other expected usage of that legacy endpoint is the Helm CLI, and they use a specific user-agent (Helm/...).

One thing we should probably do is have this project set its own user agent so it can be identified easier in the future.

That sounds like a great idea! We need to keep in mind that there will probably be older versions out there for a while.

I can't speak for all users, but I think a 6 hour rate would be reasonable. I do agree to have the chart name and the repo URL both would be very helpful.

Awesome, we'll do that then 👍

Do you know the overall size of the data structure if we were to have it in json with the above info? I don't want to put a ton of burden on the client end, but if it's something we can easily store in a few mb of memory that would make searching on the client easy.

I'd say it should be around 1MB as of today, but I can't tell with more precision right now.

I think we have enough to start implementing the new endpoint. If you can think of something else please don't hesitate pinging me 🙂

tegioz added a commit to artifacthub/hub that referenced this pull request Sep 13, 2021
Related to: sstarcher/helm-exporter#66

Signed-off-by: Sergio Castaño Arteaga <tegioz@icloud.com>
Signed-off-by: Cintia Sanchez Garcia <cynthiasg@icloud.com>
Co-authored-by: Sergio Castaño Arteaga <tegioz@icloud.com>
Co-authored-by: Cintia Sanchez Garcia <cynthiasg@icloud.com>
tegioz added a commit to artifacthub/hub that referenced this pull request Sep 13, 2021
Related to: sstarcher/helm-exporter#66

Signed-off-by: Sergio Castaño Arteaga <tegioz@icloud.com>
Signed-off-by: Cintia Sanchez Garcia <cynthiasg@icloud.com>
Co-authored-by: Sergio Castaño Arteaga <tegioz@icloud.com>
Co-authored-by: Cintia Sanchez Garcia <cynthiasg@icloud.com>
tegioz added a commit to artifacthub/hub that referenced this pull request Sep 13, 2021
Related to: sstarcher/helm-exporter#66

Signed-off-by: Sergio Castaño Arteaga <tegioz@icloud.com>
Signed-off-by: Cintia Sanchez Garcia <cynthiasg@icloud.com>
Co-authored-by: Sergio Castaño Arteaga <tegioz@icloud.com>
Co-authored-by: Cintia Sanchez Garcia <cynthiasg@icloud.com>
@tegioz
Copy link

tegioz commented Sep 13, 2021

New endpoint is ready @sstarcher 🙂

https://artifacthub.io/api/v1/helm-exporter

Response size is 715KB as of today. It'll be cached for one hour, so helm-exporter could check for updates every hour. Please let us know if something doesn't work as expected or you need anything else.

Thanks!

@tegioz
Copy link

tegioz commented Sep 13, 2021

@sstarcher
Copy link
Owner

@tegioz awsome thanks. I can likely do this update this upcoming weekend

@tegioz
Copy link

tegioz commented Sep 13, 2021

Awesome, thanks!

@sstarcher
Copy link
Owner

@tegioz please see - #67

@tegioz
Copy link

tegioz commented Sep 27, 2021

Thanks for taking care of this @sstarcher! I will take a look at it shortly 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to search chart info
3 participants