-
Notifications
You must be signed in to change notification settings - Fork 2
feat: use Squid proxy for communication with Vercel API #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…xy implementation
This reverts commit 23930ae.
|
I'd appreciate it if @gadomski could look at this. :) |
|
Sorry, I accidentally "assigned" you both rather than requesting your reviews 🙃 I should state that the dev proxy is currently running with traffic going through the Squid proxy. I can change the EC2 instance's security group and suddenly endpoints like https://data.dev.source.coop/nasa/?delimiter=%2F&list-type=2&prefix=floods%2F stop working, which gives me confidence that the traffic is indeed going through the proxy. |
gadomski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few light notes, otherwise makes sense. I'm taking @alukach at his word that this works in the real world, I didn't try to deploy it or anything.
|
@gadomski Okay, I think we're good to go now.
I will be transparent that while I did test that the data does go through the proxy, I did not test that this actually stabilizes the IP Address that Vercel sees. However, that sees like a reasonable assumption, no? |
| let source_api = web::Data::new(SourceApi::new(source_api_url)); | ||
| let source_api_url = env::var("SOURCE_API_URL").expect("SOURCE_API_URL must be set"); | ||
| let proxy_url = env::var("SOURCE_API_PROXY_URL").ok(); // Optional proxy for the Source API | ||
| let source_api = web::Data::new(SourceApi::new(source_api_url, proxy_url)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels much more explicit, thanks!
🤖 I have created a release *beep* *boop* --- ## [1.0.0](v0.1.29...v1.0.0) (2025-08-21) ### ⚠ BREAKING CHANGES * update to accommodate Product in S2 API ### Features * add headers to requests to source API ([#81](#81)) ([edda62f](edda62f)) * update to accommodate Product in S2 API ([be44f43](be44f43)) * use Squid proxy for communication with Vercel API ([#85](#85)) ([25438c3](25438c3)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: source-coop-release[bot] <187876225+source-coop-release[bot]@users.noreply.github.com>
What I'm changing
Problem
Vercel is designed to return Firewall Challenges on requests when it suspects that a client may be attempting to DoS attack. Unfortunately, our Data Proxy makes many requests to our Vercel-hosted Source API, occasionally triggering Vercel's DoS prevention logic. When this occurs, Vercel returns a
403with an HTML page presenting a button for a human user to press. Of course, our Data Proxy is expecting JSON data and throws a parse error when it experiences this response. I believe this is currently the largest issue surrounding the data proxy's reliability, wherein end users receive intermittent 403 responses from our Data Proxy.It's worth noting that our data proxy never explicitly returns 403 responses:
data.source.coop/src/utils/errors.rs
Lines 106 to 126 in 571eb94
However, it does pass through the response codes from the Source API:
data.source.coop/src/utils/errors.rs
Lines 102 to 104 in 571eb94
This means that any 403 response coming from our Data Proxy is a 403 response from our Source API (almost surely a firewall challenge when it occurs intermittently).
Solution
We can work around this by setting a WAF System Bypass Rule within Vercel, instructing it that traffic from our Data Proxy should be exempt from firewall restrictions. We can do this by providing the IP Address of our source data proxy. However, this is challenging because our data proxy runs in ECS tasks on AWS Fargate, which have ephemeral IP Addresses. To resolve this, I considered some solutions like using a NAT Gateway for outbound traffic; however, that would add an additional $0.045/GB to the existing $0.09/GB egress charge (along with an hourly rate), which would be substantial given the function of the data proxy (ie, serving lots of data). Instead, I believe the simplest solution is to send only our API traffic through an in-network proxy running on EC2 with a stable IP address. Most request clients (such as
reqwest) have built-in support for using such a proxy.How I did it
This PR:
vercel-api-${stage}.internal. This squid proxy has a security group allowing traffic only from172.31.0.0/16(ie, internal only traffic).SourceApistruct to use a helper methodbuild_req_client()that generates areqwestclient, optionally configuring it to use a proxy if thePROXY_URLenvironment variable is set.Note
Currently, the step to provide the
PROXY_URLto the ECS Task Definition is manual.Along the way...
source_api_headers(), tucking it into a method on theSourceApistruct alongside the newbuild_req_client()method. I think this makes for slightly tidier code organizationHow to test it
https://github.com/source-cooperative/data.source.coop/actions/runs/16784185577
PR Checklist
and I have opened issue/PR #XXX to track the change.
Related Issues