Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate docs.rs to RDS and ECS #353

Open
10 tasks
jdno opened this issue Oct 13, 2023 · 11 comments
Open
10 tasks

Migrate docs.rs to RDS and ECS #353

jdno opened this issue Oct 13, 2023 · 11 comments
Assignees

Comments

@jdno
Copy link
Member

jdno commented Oct 13, 2023

  • Merge open pull request for Packer
  • Build Docker image for watcher
  • Deploy watcher to ECS (only a single instance at a time!)
  • Figure out how to deploy all components together
    • Web server
    • Watcher
    • Build server
  • Test auto-scaling of builder on staging
  • Test deployments on staging
  • Configure AWS WAF to block IPs

Questions

  • How can the docs.rs team run one-off commands?
  • How are database migrations run as part of the deploy process?
@syphar
Copy link
Member

syphar commented Oct 14, 2023

another thing we need to figure out:

  • how would the database migrations be run in the deploy process?

@syphar
Copy link
Member

syphar commented Oct 14, 2023

after checking our NGINX config there is a second piece we need to solve somehow:

IP blocks.

Every now and then we have a mis-acting crawler and in these cases we blocked the source IP in NGINX on our server.

I would prefer to have this in AWS / CloudFront if possible.

Otherwise we would add this to our web container, probably configured via environment variable?

@syphar
Copy link
Member

syphar commented Oct 14, 2023

Next piece we need before prod :

Access to logs

@jdno
Copy link
Member Author

jdno commented Oct 16, 2023

For blocking IPs, we should just set up a web-application firewall (AWS WAF). I actually think that we already have one set up for docs.rs, but I'm not 100% sure.

Access to the logs is a good point! It probably makes sense to stream all logs to a central place, whether that's CloudWatch or an external tool like Datadog.

@jdno jdno self-assigned this Oct 16, 2023
@meysam81
Copy link
Contributor

@jdno Please let me know if you need a hand with any of the items in this list 🙂

@syphar
Copy link
Member

syphar commented Dec 1, 2023

@jdno coming from this discussion I want to add here that the docs.rs containers / servers should not be reachable directly from the internet. So all traffic needs to go through CloudFront & AWS WAF

@syphar
Copy link
Member

syphar commented Mar 1, 2024

One thought I had thinking about this topic again:

  • CloudFront has a hard limit on in-progress wildcard path invalidations (15)
  • we are invalidating the crate docs after each build

from rust-lang/docs.rs#1871 (comment)

Looking at https://docs.rs/releases/activity it seems we average at least 600 releases per day. If an average invalidation takes 5 minutes and we can have 15 in parallel, that's 3 invalidations per minute throughput. With 1440 minutes in a day, we could handle up to 4320 builds per day before we wind up in unbounded growth land. Of course, that's based on a significant assumption about how long an invalidation takes.

I'm not sure if we can / should handle invalidations differently, but we might think about using fastly when we rework the infra?

@Mark-Simulacrum
Copy link
Member

Can't we de-duplicate invalidations if we approach the limit? E.g., a * invalidation every 5 minutes would presumably never hit the limit. Not sure how that would affect cache hit rates, but I'd expect designing around not needing invalidations or being ok with fairly blanket invalidations to be a good long-term strategy.

(I think we've had this conversation elsewhere before).

@syphar
Copy link
Member

syphar commented Mar 1, 2024

Can't we de-duplicate invalidations if we approach the limit? E.g., a * invalidation every 5 minutes would presumably never hit the limit.

You mean "escalating" them, so when the queue is too long, just convert the queue into a full purge.
This is definitely would work, but would mean that the user experience (especially outside the US) is worse until the cache is fuller again. Of course this might be acceptible for us.

being ok with fairly blanket invalidations

This also means that the backend always has to be capable to handle the full uncached load, and higher egress costs depending on how often we have to do the full purge.

I also remember a discussion at EuroRust that we could think about having additional docs.rs webservers (also readonly DB & local bucket?) in some regions (europe?).

I'd expect designing around not needing invalidations

You're right, this is a valid discussion to have. I imagine this would only work when the URLs would include something like the build-number in the URL, and replace the more generic URLs rest with redirects. If I'm not missing something this would revert some of the SEO & URL work from rust-lang/docs.rs#1438 (introducing /latest/ URLs ). And then people would start linking specific docs-builds in their sites as they did before we had /latest/.

(I think we've had this conversation elsewhere before).

you're probably right :)

I wanted to bring it up here as a point, for when we migrate infra anyways.

@Mark-Simulacrum
Copy link
Member

Note that (IMO) if we can get the cache keys setup right, i.e. everything except HTML is always at a by-hash file path - it seems to me that /latest/ can just be served with a short ttl (5 minutes), perhaps with stale-while-revalidate. That means that there's a small period where it's not necessarily consistent what version you get from it across all pages if some are cached locally and some aren't (and likewise for CDN), but I don't see any real problem with that. Users mostly won't even notice.

Yes, especially anything out of s3 can definitely be replicated if we need it to be into multiple regions pretty easily. This just causes issues while you need invalidations since you're racing against replication which can itself take some time (hours IIRC for the cheap option and minutes for the costly one?)

@syphar
Copy link
Member

syphar commented Mar 7, 2024

Note that (IMO) if we can get the cache keys setup right, i.e. everything except HTML is always at a by-hash file path - it seems to me that /latest/ can just be served with a short ttl (5 minutes), perhaps with stale-while-revalidate. That means that there's a small period where it's not necessarily consistent what version you get from it across all pages if some are cached locally and some aren't (and likewise for CDN), but I don't see any real problem with that. Users mostly won't even notice.

Jep, everything except HTML should have already have hashed filenames, with some small exceptions.
For HTML I (personally) would prefer longer caching duration, 5 minutes outdated is probably fine, not sure now much that would reduce the user happiness for some crates. I'll probably try to get better data on how the cache coverage for certain crates is at some point and see in more detail how the impact would be on users. And it might also be the case that it's just me that needs these kind of response times for docs :)

Yes, especially anything out of s3 can definitely be replicated if we need it to be into multiple regions pretty easily. This just causes issues while you need invalidations since you're racing against replication which can itself take some time (hours IIRC for the cheap option and minutes for the costly one?)

That's good to know, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

4 participants