Skip to content
This repository has been archived by the owner on Jan 9, 2021. It is now read-only.

some www.galaxyzoo.org urls contain '+' chars which don't resolve via s3 server #82

Closed
camallen opened this issue Aug 20, 2020 · 5 comments

Comments

@camallen
Copy link
Contributor

related to #81 and #64

A GZ subject thumbnail URL like www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg
will redirect to s3 URL via nginx static rewrite at https://github.com/zooniverse/static/blob/fe42d006be275b5e59e6e584e67fbeff500f426a/sites/www.galaxyzoo.org.conf#L10

E.g. the above subject URL redirects to the literal '+'
this doesn't
https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg
this works
https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg

I believe this will be the same in azure land (needs testing)
https://docs.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata#blob-names

I haven't found a decent way to encode the URL in nginx (which strikes me as very strange) and i need to test how these '+' symbols in urls work in azure as well.

We may need to encode these URLs explicitly before publishing them to ensure they work as we expect. TDB

@eatyourgreens
Copy link
Contributor

Have you got the URL for that subject on the old and new sites? I'm interested to see if the browser encodes the URL by default. If not, we can explicitly encode it when we parse subject locations. Assuming that doesn't break subject URLs for any other projects.

@camallen
Copy link
Contributor Author

So this subject is one of them https://talk.galaxyzoo.org/subjects/AGZ000atp8/

From this collection https://talk.galaxyzoo.org/collections/CGZS0003tq/ the URL is encoded correctly

<img loading="lazy" alt="Subject AGZ000atp8" src="https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg">

however if we use the rewritten non-s3 URL www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg we get a redirect via the nginx proxy to an s3 URL which has the path decoded in the rewritten location, https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg

That's not great - it seems the issue here is the nginx static proxy and the rewrite rule. We may have to proxy pass these URLs (serve them directly) via NGINX instead of redirecting them to avoid this issue.

@camallen
Copy link
Contributor Author

camallen commented Aug 20, 2020

This is getting more interesting....after testing a local version of the static nginx proxy, our static proxy seems to be preserving the encoded URLs correctly. Note the Location response header below

Local test of static proxy (only debug headers added)

$ curl -v -H "Host: www.galaxyzoo.org"  localhost:8080/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8080 (#0)
> GET /subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg HTTP/1.1
> Host: www.galaxyzoo.org
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.4.6 (Ubuntu)
< Date: Thu, 20 Aug 2020 21:46:19 GMT
< Content-Type: text/html
< Content-Length: 193
< Connection: keep-alive
< Location: https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
< X-debug-message: /subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg
< X-debug-message: /subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
< X-debug-message: https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
< 
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.4.6 (Ubuntu)</center>
</body>
</html>
* Connection #0 to host localhost left intact

This is a https redirect, note the encoding is preserved

$ curl -v www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
*   Trying 52.186.94.16...
* TCP_NODELAY set
* Connected to www.galaxyzoo.org (52.186.94.16) port 80 (#0)
> GET /subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg HTTP/1.1
> Host: www.galaxyzoo.org
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 308 Permanent Redirect
< Server: nginx/1.17.10
< Date: Thu, 20 Aug 2020 20:49:44 GMT
< Content-Type: text/html
< Content-Length: 172
< Connection: keep-alive
< Location: https://www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
< 
<html>
<head><title>308 Permanent Redirect</title></head>
<body>
<center><h1>308 Permanent Redirect</h1></center>
<hr><center>nginx/1.17.10</center>
</body>
</html>
* Connection #0 to host www.galaxyzoo.org left intact

However when we hit the https url we lose the encoding :(

$ curl -v https://www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
*   Trying 52.186.94.16...
* TCP_NODELAY set
* Connected to www.galaxyzoo.org (52.186.94.16) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
...TLS stuff removed
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fe5ee806600)
> GET /subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg HTTP/2
> Host: www.galaxyzoo.org
> User-Agent: curl/7.54.0
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 301 
< server: nginx/1.17.10
< date: Thu, 20 Aug 2020 20:57:36 GMT
< content-type: text/html
< content-length: 193
< location: https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg
< strict-transport-security: max-age=15724800; includeSubDomains
< 
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.4.6 (Ubuntu)</center>
</body>
</html>
* Connection #0 to host www.galaxyzoo.org left intact

Note the response Location header above is where lose the encoding, the request from the client is still encoded.

Nginx logs in k8s are the decoded URL, it appears that the nginx ingress is rewriting the URL before it hits the static proxy pod

10.244.14.97 www.galaxyzoo.org - [20/Aug/2020:20:57:36 +0000] "GET /subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg HTTP/1.1" 301 193 "-" "curl/7.54.0" -

This looks relevant, kubernetes/ingress-nginx#1615 (comment)
our nginx ingress config looks like this

## start server *.galaxyzoo.org
	server {
		server_name *.galaxyzoo.org ;
		
		listen 80  ;
		listen 443  ssl http2 ;
		
		set $proxy_upstream_name "-";
		
		ssl_certificate_by_lua_block {
			certificate.call()
		}
		
--
			proxy_next_upstream_tries               3;
			
			rewrite "(?i)/" /$1 break;
			proxy_pass http://upstream_balancer;
			
			proxy_redirect                          off;
			
		}
		
	}
	## end server *.galaxyzoo.org

@eatyourgreens
Copy link
Contributor

Looking at it on my phone, the image is broken in this discussion about that subject. I rebuilt the discussion pages this morning. https://talk.galaxyzoo.org/boards/BGZ0000004/discussions/DGZ0001krf/ So that would be the redirect breaking the location? The subject and collections pages are built from master, but the discussion page is built from #81.

@camallen
Copy link
Contributor Author

resolved by zooniverse/static#176

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants