Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize sitemap #2492

Merged
merged 16 commits into from Sep 5, 2016
Merged

Finalize sitemap #2492

merged 16 commits into from Sep 5, 2016

Conversation

rap1ds
Copy link
Member

@rap1ds rap1ds commented Sep 1, 2016

Cherry-picked commits from #2303

@@ -379,6 +379,14 @@ default: &default_settings
#
font_proximanovasoft_url:

# Maximum number of links in sitemap.xml.
#
# The default number is 5000. Increasing the limit may affect on performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's either "affect performance" or "have effect on performance"

@rap1ds rap1ds force-pushed the finalize-sitemap branch 2 times, most recently from aff9338 to 0ea4e1b Compare September 1, 2016 13:53
@@ -3,3 +3,4 @@
# Add new mime types for use in respond_to blocks:
# Mime::Type.register "text/richtext", :rtf
# Mime::Type.register_alias "text/html", :iphone
Mime::Type.register "application/x-gzip", :xml_gz, [], ["xml.gz"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed anymore? The Rack::Deflater handles now the gzipping, and in the response we send:

Content-Encoding: gzip
Content-Type: application/xml

Full cURL:

➜  ~ curl 'http://test.lvh.me:3000/sitemap.xml' -H 'Pragma: no-cache' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8,fi;q=0.6,de;q=0.4,fr;q=0.2' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Cache-Control: no-cache' -H 'Cookie: _sharetribe_session_newnewnew=2194b4c4245929eb8ac04fb7ae4ebf50; _sharetribe_session_newnewnew4=814a4b5ea25b5777676c9b8c5724415f; _st_session=c9d6795069a8a65b41f936ff5c1db67e' -H 'Connection: keep-alive' -i
HTTP/1.1 200 OK
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Content-Disposition: attachment; filename="sitemap.xml"
Content-Transfer-Encoding: binary
Content-Type: application/xml
Cache-Control: private
Etag: W/"e0002c666520520bb6e5ded9b894e758"
X-Meta-Request-Version: 0.4.0
X-Request-Id: 35c92e0a-d63f-4b7f-82f6-541dcb6116ce
X-Runtime: 0.048266
Vary: Accept-Encoding
Content-Encoding: gzip
Server: WEBrick/1.3.1 (Ruby/2.3.1/2016-04-26)
Date: Thu, 01 Sep 2016 13:57:19 GMT
Content-Length: 467
Connection: Keep-Alive
Set-Cookie: _st_session=c9d6795069a8a65b41f936ff5c1db67e; path=/; expires=Sat, 01 Oct 2016 13:57:19 -0000; HttpOnly

So to my understanding we can remove this, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that mime type applies to content being transferred and transfer encoding only applies to the transfer, i.e. the sitemap.xml is delivered over the wire as gzipped but the client delivers it to requesting app as xml. With application/x-gzip the content itself is gzipped so if you serve a sitemap.xml.gz file then that's appropriate mime type. I assume we only serve the xml and not a .gz file containing the xml? In that case application/xml is the correct mime type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for explanation!


if APP_CONFIG.asset_host.present?
redirect_to ActionController::Base.helpers.asset_url(
"/sitemap/generate.xml?sitemap_host=#{request.host}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass the host instead of pulling it in generate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it, we use different host for assets if asset_host is set.

@rap1ds rap1ds mentioned this pull request Sep 1, 2016
include ActionController::Rescue
include ActionController::Head
include ActionController::Redirecting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With metal controller you want to add:

  # Ensure ActiveSupport::Notifications events are fired
  include ActionController::Instrumentation

This triggers the structured request logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I checked my facts. Instrumentation alone is not enough. append_into_to_payload method also needs to be overridden. The version used in LandingPageController is:

  # Override basic instrumentation and provide additional info for
  # lograge to consume. These are pulled and logged in environment
  # configs.
  def append_info_to_payload(payload)
    super
    payload[:host] = request.host
    payload[:community_id] = community(request)&.id
    payload[:current_user_id] = nil
    payload[:request_uuid] = request.uuid
  end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Dan Moore and others added 9 commits September 5, 2016 15:02
Only display active listings for public communities.  Use the rails
caching to generate it once a day.

can now push sitemap from controller without intermediate file

added support for rails caching

not needed, all done at runtime

fixed issue where community id wasn't being passed appropriately, and added more tests

remove generate method because it isn't used anymore (no cdn support at this time).  also refactored out the call to find the listings

actually really call out to the abstraction

use nice url in sitemap

converted to metal controller and send responses back when a community is private or deleted

replace default host with simpler solution

moving to using currently open

limit to 50k urls

corrected name of method

make class method private

use id as a cache key so that we don't invalidate if the community object changes, just once a day

use methods rather than accessing members

fix for cache value--making it more generic than just the community id

compress the data before adding it to the cache

switched to using pluck, per request

use more elegant map method for organizing plucked data

fixing errors found by rubocop

fixing rubucop errors that slipped through last time
Only display active listings for public communities.  Use the rails
caching to generate it once a day.

can now push sitemap from controller without intermediate file

added support for rails caching

not needed, all done at runtime

fixed issue where community id wasn't being passed appropriately, and added more tests

remove generate method because it isn't used anymore (no cdn support at this time).  also refactored out the call to find the listings

actually really call out to the abstraction

use nice url in sitemap

converted to metal controller and send responses back when a community is private or deleted

replace default host with simpler solution

moving to using currently open

limit to 50k urls

corrected name of method

make class method private

use id as a cache key so that we don't invalidate if the community object changes, just once a day

use methods rather than accessing members

fix for cache value--making it more generic than just the community id

compress the data before adding it to the cache

switched to using pluck, per request

use more elegant map method for organizing plucked data

fixing errors found by rubocop

fixing rubucop errors that slipped through last time
- Content negotiation (send gzip only if client accepts gzip)
- Use Rack::Deflate for gzippping
- Add max_sitemap_limit to config_defaults.yml
- Controller specs don't work well with Metal controllers
This was originally added for sitemap.xml.gz. However, we changed it so that we don't serve sitemap.xml.gz, but only sitemap.xml and let the Rack::Deflate middle ware to handle the transfer encoding.
The format that the crawlers expect when then they request sitemap.xml is xml. It doesn't make sense to redirect the request to URL which returns HTML. It's better just to send 404 status code, which is informative enough for the crawler.
- Generates path without protocol, i.e. starts with //asset.host.com
@rap1ds rap1ds merged commit 8626e12 into master Sep 5, 2016
@rap1ds rap1ds deleted the finalize-sitemap branch September 5, 2016 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants