Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sitemap.xml and submit to google and friends for faster re-indexing after edits #2586

Closed
8 of 11 tasks
brendanheywood opened this issue Feb 6, 2017 · 14 comments
Closed
8 of 11 tasks
Labels
1: MUST BE RESOLVED FOR RELEASE 2: Bug fix Feature that does not work as intended, broken UI, problem that detract from normal user experience 5: Enhancement Build up on existing function that adds or improve functionality significantly Area: SEO OpenGraph OpenSearch Related to external search engine results

Comments

@brendanheywood
Copy link
Member

brendanheywood commented Feb 6, 2017

Been reading up a bit more on sitemaps and come to the conclusion we'd benefit a lot from having these in place. The primary reason is a bulk way of alerting google and friends to what has recently been updated so they can prioritize what to re-index. It also allows us to give a clear signal about which pages are more important than others.

https://www.sitemaps.org/protocol.html

Some thoughts so far

  • we have way too many urls to fit into one sitemap so we will need a master index and sub sitemap's. Each index can only contain 50k urls so we need to figure our what the best way is to slice them up which is future proof as we grow. I was thinking about 1 per country or something like that but that's too great. Possibly the simplest is just the whole index in nodeid order broken into pages. I don't think there is any real benefit to having related pages grouped into the sub sitemaps.
    So the master would a new template at /sitemap.xml
  • all sub sitemaps must be at the root level, the protocol has assumptions about what urls it can 'own'
    All the sub sitemaps would be /sitemap-1.xml /sitemap-2.xml etc
  • each sitemap has both a url limit and a file size limit. Easiest is to just select a max record which satisfies both, so this might end up being 20k records instead of 50k, need to test and tune.
  • lastmod - pull the date out from the last stream edit BUT do we know that last update to this page, not simply it's descendants?
  • what urls to include? I think for the first cut we just include node urls. It will still crawl from there to others but these are the main ones where we care about faster re-indexing.
  • add the link to the robots.txt file
  • manually go check google at least to make sure it's slurping it up
  • generating this will probably be expensive. Do we want it generated on the fly, or cached? Ideally avoid any caching unless we absolutely need to. Perhaps we can get around this by simply having a smaller page size so each sitemap is fairly fast but we have lots of them.

Optional but desirable (we can skip in mvp)

  • change frequency - we can probably do something with the stream data to derive this
  • priority should take into account how big and useful a crag is, eg we could use crag credits algorithm, and possibly simpler heuristics like TLC's and regions are more important than lower down.

Need to think about:

  • changed url stubs, reparents etc. Do we include recent reparents for say 1 months? Can we include old url and canonical url

Need a consistent way to say when a page was last modified. Eg if a field was modified then yes. If it's child name was then yes, but not a child's content. An annotation yes. A route yes. A topo yes. So need to store that as a stat and keep it current. Use this to mark pages for new scrape in index

@brendanheywood brendanheywood added 5: Enhancement Build up on existing function that adds or improve functionality significantly Area: SEO OpenGraph OpenSearch Related to external search engine results labels Feb 6, 2017
@brendanheywood
Copy link
Member Author

Andrew found an issue with the bot he is writing which found a route not being indexed:

image

@brendanheywood brendanheywood added the 2: Bug fix Feature that does not work as intended, broken UI, problem that detract from normal user experience label Jul 29, 2017
@andrewk1
Copy link

Hey there, found another that doesn't index properly:

https://www.thecrag.com/climbing/united-states/central-cascade-mountains/route/19183189

@andrewk1
Copy link

@brendanheywood
Copy link
Member Author

@andrewk1 did you want these fixed or happy to leave them as test cases?

@andrewk1
Copy link

andrewk1 commented Jul 30, 2017 via email

@scd
Copy link
Member

scd commented Jul 30, 2017

I was thinking about how to split the sitemap. Does their have to be any logical order in the urls. If not, then could we hash the url and use the resulting has to put it in one of say 16 buckets?

I think we might know when a node is directly updated as opposed to a descendant. I am pretty sure I split the last updated field into two for this purpose.

@brendanheywood
Copy link
Member Author

@scd no order doesn't matter, I don't think we need to hash anything, instead we can base it completely off the last mod. After reading a couple blog posts around sitemaps for very large sites I'm thinking we should paginate into fixed time periods. There are two use cases we need to support with the sitemap, the first is the once off big index to ensure everything is indexed and then the incremental ongoing re-index. The latter can be implemented using either sitemaps or atom feeds but either way the logic and output is very similar so I think we should just stick with just sitemaps and get it solid and only worry about atom if we really need to.

So the master sitemap would be something vaguely like:

sitemap.xml
 - sitemap-201707.xml (anything updated so far this month)
 - sitemap-201706.xml (anything updated in june and not since)
 - sitemap-201705.xml (anything updated in may and not since)
 - .....
 - sitemap-201701.xml (anything updated in jan and not since)
 - sitemap-older.xml (anything updated older than jan)

So the idea is that when a robot grabs the latest master sitemap, it know's that it's last crawl date was say a week ago, and so it know that the only sub sitemap it needs is the -latest one and it ignores the rest. The size of the fixed time periods is arbitrary and we can tune it to ensure that no page is > 50k records. Month is probably the right size here but we need to run some numbers to validate this. If a page is updated in mar, and then in june, then it no longer appears in the mar sitemap.

My general thinking is that most of the time the robot will only ever hit the master and then the latest, and so the load will be very low and we can probably make this a live db query and not pregenerate these files. Generating the master file should not need anything other than todays date, and the hard coded 'oldest' date. It should not even require a db query at all.

An example fairly close to what we want is here:

http://www.realestate.com.au/news/sitemaps/sitemap_index.xml

Each sub sitemap will return results in lastmod order purely to make it easier to validate and test by manually and confirm that things are getting updated when they should be.

We also need to clarify the business rules for when a node's last mod timestamp gets bumped. It is not as simple as 'this nodes lastmod' vs 'max of any descendant nodes lastmod', it should be much closet to 'max of this nodes lastmod and its direct child nodes lastmod'. This feels to me like it should be it's own statistic as I can imagine the rules will get hair for some object types and we definitely don't want that logic coupled to the sitemap logic.

@brendanheywood brendanheywood added this to the Release 57 - New redesign + homepage milestone Jul 31, 2017
@brendanheywood
Copy link
Member Author

fyi google has just confirmed that the first rescrape using the new webmasters tool account is ready to go, so I think we can go ahead and implement this whenever we are ready.

@scd
Copy link
Member

scd commented Sep 10, 2017

@scd
Copy link
Member

scd commented Sep 10, 2017

@brendanheywood I have started on this.

A couple of questions.

What index urls do we want to include. Just the base index url or forums as well? any others?

Do we want to also include users?

I presume we also want our articles. In all languages? We don't know the date these articles were last updated as this depends on translation updates. Can we just assume that they always need re indexing at the beginning of each month?

@scd
Copy link
Member

scd commented Sep 10, 2017

Thinking about this a little more, we should have every url we want indexed in the sitemap files.

Every now and again we will do a page upgrade which applies to all urls of that category (eg routes). When this happens we want all routes to be scheduled to be re-indexed. This process is exactly the same as the initial sitemap index.

So in the sitemap generator I will have a reexp on urls that has an override date for last update. This would work for our help articles by just setting the last update date manually for all help articles. I don't think it is worthwhile having any smarts which works out when translations were last updated, for example.

I like your suggested format of creating time based sitemaps. Monthly seems right. It seems ok just to do the last 12 months.

Over time we will refine the rules for the lastUpdated date, but it will most likely just end up being the node (or account) record last updated field. I think it is reasonable that the the last updated date for the app is the same as for the sitemap.

Forums has a particular issue. We should only submit forum urls where there is a direct post, and use the last comment date.

@brendanheywood if you get any time can you review and extend the list of all the url categories we need in the sitemap.

  • routes
  • areas
  • forums
  • accounts
  • help articles
  • home page

All this makes me think that dynamically creating this is not so important. I will just cron this daily.

Note that I am using the same function as the app does for getting updated index info. This makes me think we are on the same right track. I have had to add the cannonical url to the list of fields returned because it this has not been required up until now. I imagine in the fullness of time the app will require the node url as well.

@brendanheywood
Copy link
Member Author

brendanheywood commented Sep 11, 2017

What index urls do we want to include. Just the base index url or forums as well? any others?

Google is already finding lots of other edge case urls which we are now trimming back. I think the list above is a solid start so lets give that a few months to settles and then we can patch up anything else we missed. Except forums vs discussions see below

Do we want to also include users?

I presume we also want our articles.

yes

In all languages?

Yes but probably defer - we need 1 unique url per language per article before we push this into the sitemap. We can also do multi lang with 1 url but it's messier and we need the multi urls for the multi lang content anyway so would be better to be consistent across the site.

We don't know the date these articles were last updated as this depends on translation updates. Can we just assume that they always need re indexing at the beginning of each month?

We will shortly - we need this same meta data to do the 'whats changed recently in english that needs to be reviewed in other langs' feature. In the interim 1 month is fine

Every now and again we will do a page upgrade which applies to all urls of that category (eg routes). When this happens we want all routes to be scheduled to be re-indexed. This process is exactly the same as the initial sitemap index.

Perfect I had the same thought. However we should be quite careful with this, like everything in the sitemap it's just another signal, so if robots think stuff hasn't changed when we say it has, then it could pay less attention to it.

Forums has a particular issue. We should only submit forum urls where there is a direct post, and use the last comment date.

  • I'm not sure we want to index forum pages at all, these are vaguely like a facet page and I don't think we should ever index any facet pages either. We definitely do want to index the thread pages themselves so I'd start with just them first.

@scd
Copy link
Member

scd commented Oct 24, 2017

This issue has become to big an unwieldy for me to know if there is anything else to do. The main stuff is done but there is a lot of edge case comments that are not fully compete. I think we should close this and create other issues.

At the moment it looks like the following are still outstanding:

  • list of example routes not being indexed by google. We just have to wait and keep checking from time to time
  • frequency. At the moment the sitemap is scheduled to rebuild monthly.
  • adjust indexing priorities. This is fine tuning that we will not even think about for months.
  • changing cannonical urls. This just comes out in the wash when google re-indexes the page.

Looking at all these I am happy just to close this issue, and create new ones if we felt that something needed attention.

@brendanheywood
Copy link
Member Author

  1. I am monitoring this so nothing to do
  2. monthly is fine for now
  3. priorities are ignored
  4. see trello card

So yes lets close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1: MUST BE RESOLVED FOR RELEASE 2: Bug fix Feature that does not work as intended, broken UI, problem that detract from normal user experience 5: Enhancement Build up on existing function that adds or improve functionality significantly Area: SEO OpenGraph OpenSearch Related to external search engine results
Projects
None yet
Development

No branches or pull requests

3 participants