New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sitemap.xml and submit to google and friends for faster re-indexing after edits #2586
Comments
Hey there, found another that doesn't index properly: https://www.thecrag.com/climbing/united-states/central-cascade-mountains/route/19183189 |
@andrewk1 did you want these fixed or happy to leave them as test cases? |
It's fine to leave these as examples
|
I was thinking about how to split the sitemap. Does their have to be any logical order in the urls. If not, then could we hash the url and use the resulting has to put it in one of say 16 buckets? I think we might know when a node is directly updated as opposed to a descendant. I am pretty sure I split the last updated field into two for this purpose. |
@scd no order doesn't matter, I don't think we need to hash anything, instead we can base it completely off the last mod. After reading a couple blog posts around sitemaps for very large sites I'm thinking we should paginate into fixed time periods. There are two use cases we need to support with the sitemap, the first is the once off big index to ensure everything is indexed and then the incremental ongoing re-index. The latter can be implemented using either sitemaps or atom feeds but either way the logic and output is very similar so I think we should just stick with just sitemaps and get it solid and only worry about atom if we really need to. So the master sitemap would be something vaguely like:
So the idea is that when a robot grabs the latest master sitemap, it know's that it's last crawl date was say a week ago, and so it know that the only sub sitemap it needs is the -latest one and it ignores the rest. The size of the fixed time periods is arbitrary and we can tune it to ensure that no page is > 50k records. Month is probably the right size here but we need to run some numbers to validate this. If a page is updated in mar, and then in june, then it no longer appears in the mar sitemap. My general thinking is that most of the time the robot will only ever hit the master and then the latest, and so the load will be very low and we can probably make this a live db query and not pregenerate these files. Generating the master file should not need anything other than todays date, and the hard coded 'oldest' date. It should not even require a db query at all. An example fairly close to what we want is here: http://www.realestate.com.au/news/sitemaps/sitemap_index.xml Each sub sitemap will return results in lastmod order purely to make it easier to validate and test by manually and confirm that things are getting updated when they should be. We also need to clarify the business rules for when a node's last mod timestamp gets bumped. It is not as simple as 'this nodes lastmod' vs 'max of any descendant nodes lastmod', it should be much closet to 'max of this nodes lastmod and its direct child nodes lastmod'. This feels to me like it should be it's own statistic as I can imagine the rules will get hair for some object types and we definitely don't want that logic coupled to the sitemap logic. |
fyi google has just confirmed that the first rescrape using the new webmasters tool account is ready to go, so I think we can go ahead and implement this whenever we are ready. |
Sitemap index file format: https://support.google.com/webmasters/answer/75712?visit_id=1-636406182090731367-1077418191&rd=1 |
@brendanheywood I have started on this. A couple of questions. What index urls do we want to include. Just the base index url or forums as well? any others? Do we want to also include users? I presume we also want our articles. In all languages? We don't know the date these articles were last updated as this depends on translation updates. Can we just assume that they always need re indexing at the beginning of each month? |
Thinking about this a little more, we should have every url we want indexed in the sitemap files. Every now and again we will do a page upgrade which applies to all urls of that category (eg routes). When this happens we want all routes to be scheduled to be re-indexed. This process is exactly the same as the initial sitemap index. So in the sitemap generator I will have a reexp on urls that has an override date for last update. This would work for our help articles by just setting the last update date manually for all help articles. I don't think it is worthwhile having any smarts which works out when translations were last updated, for example. I like your suggested format of creating time based sitemaps. Monthly seems right. It seems ok just to do the last 12 months. Over time we will refine the rules for the lastUpdated date, but it will most likely just end up being the node (or account) record last updated field. I think it is reasonable that the the last updated date for the app is the same as for the sitemap. Forums has a particular issue. We should only submit forum urls where there is a direct post, and use the last comment date. @brendanheywood if you get any time can you review and extend the list of all the url categories we need in the sitemap.
All this makes me think that dynamically creating this is not so important. I will just cron this daily. Note that I am using the same function as the app does for getting updated index info. This makes me think we are on the same right track. I have had to add the cannonical url to the list of fields returned because it this has not been required up until now. I imagine in the fullness of time the app will require the node url as well. |
Google is already finding lots of other edge case urls which we are now trimming back. I think the list above is a solid start so lets give that a few months to settles and then we can patch up anything else we missed. Except forums vs discussions see below
yes
Yes but probably defer - we need 1 unique url per language per article before we push this into the sitemap. We can also do multi lang with 1 url but it's messier and we need the multi urls for the multi lang content anyway so would be better to be consistent across the site.
We will shortly - we need this same meta data to do the 'whats changed recently in english that needs to be reviewed in other langs' feature. In the interim 1 month is fine
Perfect I had the same thought. However we should be quite careful with this, like everything in the sitemap it's just another signal, so if robots think stuff hasn't changed when we say it has, then it could pay less attention to it.
|
This issue has become to big an unwieldy for me to know if there is anything else to do. The main stuff is done but there is a lot of edge case comments that are not fully compete. I think we should close this and create other issues. At the moment it looks like the following are still outstanding:
Looking at all these I am happy just to close this issue, and create new ones if we felt that something needed attention. |
So yes lets close this |
Been reading up a bit more on sitemaps and come to the conclusion we'd benefit a lot from having these in place. The primary reason is a bulk way of alerting google and friends to what has recently been updated so they can prioritize what to re-index. It also allows us to give a clear signal about which pages are more important than others.
https://www.sitemaps.org/protocol.html
Some thoughts so far
So the master would a new template at /sitemap.xml
All the sub sitemaps would be /sitemap-1.xml /sitemap-2.xml etc
Optional but desirable (we can skip in mvp)
Need to think about:
Need a consistent way to say when a page was last modified. Eg if a field was modified then yes. If it's child name was then yes, but not a child's content. An annotation yes. A route yes. A topo yes. So need to store that as a stat and keep it current. Use this to mark pages for new scrape in index
The text was updated successfully, but these errors were encountered: