-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing: Add an incremental reindex (only indexing new items) #34
Comments
I wonder if polling their RSS feed would be a solution? I realize not all sites offer RSS feeds but it seems the majority do and for those that have it, you can poll their RSS feed instead of crawling. It should also solve the issue of the knowing what's been updated or not since RSS takes care of that. Ray |
Yes, looking for changes in the RSS feed and/or sitemap sounds a good idea. I'm using Scrapy's generic CrawlSpider, and could continue using that for the less frequent "reindex everything regardless". However, there is a SitemapSpider class, and it can be extended to only yield items to index based on a user-defined class (via the sitemap_filter) so that could potentially do a comparison of previous item dates. It would need a bit of a rethink of the whole approach to indexing though, so not trivial. |
I've renamed this to reflect the focus on the slightly simpler aim of determining whether pages within a site should be reindexed, rather than the broader question of whether the site itself should be reindexed. I've also split off the "Crawl from sitemap and/or RSS" thread to a separate issue #54 . Two options to improve the efficiency of the current spider:
All the info required for indexing a site should be passed into the CrawlerRunner at the start of indexing for a site, so that no further database or Solr lookups are requried during indexing of a site. At the moment, all the config for indexing all sites is passed into the CrawlerRunner via common_config, and all the config for indexing a site is passed into the CrawlerRunner via site_config, but there is nothing passed in for page level config. Suggest a one-off lookup to get a list of urls with page_last_modified and/or etag values, and pass that into the CrawlerRunner via site_config or maybe even a new page_config. The Solr query to get all pages on the michael-lewis.com domain with a page_last_modified set is /solr/content/select?q=:&fq=domain:michael-lewis.com&fq=page_last_modified:* A check with the scrapy shell:
Suggests a 304 response is returned with no content. Within SearchMySiteScript adding the following:
gets a:
This indicates that the 304 response will be skipped, but the issue is that it means links on that page are not spidered either, so if the home page is not indexed then nothing else will be either. At present no internal link are recorded within the Solr entry so they can't be pulled from there. Workaround could be to not set the If-Modified-Since header where is_home=true. And of course if combined with #54 the impact would be minimised. The issue then would be that you would have to keep track of the skipped pages, so they wouldn't be deleted in the All of which brings us back to the idea of doing a not completely robust "incremental" reindex such as that described here more often, and the current robust "full" reindex less often, which would be a bigger change. BTW, Scrapy does have a HttpCacheMiddleware with RFC2616 policy which could potentially simplify this implementation. However, by default it would cache all the requests on the file system, which could lead to a significant increase in disk space requirements, and hence cost. You can implement your own storage backend, so in theory you might be able to use Solr, although I suspect the format that the data is stored in would be different. |
I'm learning a lot from following your thoughts on this. I like the idea of a Solr query before starting the crawl to gather all pages on the site with Last-Modified (or Etag) values. I'm thinking a scrapy custom storage backend would deal with only those values and would not store the page content at all. I didn't get the impression the custom backend imposes a data format. Just a You pointed out that'll make scrapy not crawl links in pages that respond 304. I think that's OK: The Solr query could return all pages, not just those with cache headers. As the scrapy crawl progresses, just remove each visited page from the list. Then invoke scrapy again on one of the pages that was not visited. Repeat until the list is empty. Alas, for a site like mine that responds 304 to most pages, that'd mean you must invoke scrapy many times for each crawl. That'll introduce some overhead. (Unless you can submit a list of URLs to the initial scrapy invocation?) I assume your code removes pages from the Solr index if they respond 404. Add code to leave the page unchanged in the Solr index on a 304 response. Sorry to be talking so much without having absorbed all of the code. I may be off base as a result. |
I'm currently letting Scrapy do the web crawling work, i.e. recursively requesting pages and extracting links, keeping track of what it has already requested and what it still has to do, handling deduplication, etc. Each crawl starts afresh, resulting in a relatively clean list of all the unique pages it has found (e.g. 404s aren't indexed). I then simply delete all the existing pages in the search index and insert the new ones to ensure the index is clean and up-to-date with no stale pages left behind. This keeps it fairly simple. The risk with a new incremental index is that it could end up requiring a lot of custom web crawling code, e.g. to keep track of what was indexed this time, what was indexed last time, what needs to be added to the index, what needs to be removed from the index, etc., and even then with all the extra work it may result in a less clean index. It could still be useful doing this extra work though, maybe not to improve the efficiency of the indexing as much, but more to check for added and updated pages more frequently than it does at the moment. For example, the more regular incremental index could work off its own list of pages to index rather than recursively spidering, to:
Then the less regular full reindex could:
A middle ground may be to investigate using the HttpCacheMiddleware with RFC2616 policy - that might be simpler to implement and good enough. |
Some existing solutions: Google supports WebSub, which is also a critical part of the IndieWeb. This way, updates can be pushed to Search My Site in real-time and no polling is needed. Bing and Yandex use IndexNow. You don't have to participate in the IndexNow initiative but you could support the same API so people could re-use existing tools. Finally, there's the option of polling Atom, RSS, and h-feeds. This should probably not be the default behavior, but something that authenticated users could opt in to. |
@Seirdy , Many thanks for your suggestions. I would like to investigate WebSub and Index Now further, but it isn't a priority because many of the smaller static sites (which are the main audience) don't support it, and there isn't a pressing need for the latest up-to-the-minute content. For now, my plan is to have two sorts of indexing: (i) the current periodic full reindex (spidering the whole site to ensure all content is indexed, that moved/deleted pages are removed etc.), and (ii) a new much more frequent incremental index (just requesting the home page, RSS, and sitemap and indexing new/updated links it finds on those). I've got as far as identifying the RSS and sitemaps (see also #54 ), but don't think I'll get chance to fully implement the incremental reindex for another month or two. |
…new field to indicate if a page is included in a web feed, and added support for #34 Add an incremental reindex
I've written and deployed the code to perform an incremental index, but not had chance yet to write the code to trigger the incremental index from the scheduler, so it is there but not in use. Both types of indexing use the same SearchMySiteSpider with the indexing type passed in via the site config, and it has been implemented with out-of-the-box scrapy extensions (with the exception of the actual feed parsing which uses feedparser), to keep things relatively simple. In summary: Full reindex:
Incremental reindex:
When the scheduling is implemented, the plan is for:
This should make the list of pages which are in web feeds reasonably up-to-date, which will be useful for some of the functionality spun off from #71 . |
…_indexing_status to indexing_status, full_indexing_status_changed to indexing_status_changed, part_indexing_status to last_index_completed, part_indexing_status_changed to last_full_index_completed, and default_part_reindex_frequency to default_incremental_reindex_frequency, for #34
…sic and free, and from 1 day to 3.5 days for full, for #34.
…ated the code to detect if it is a web feed to use data identified while parsing the item, for #54.
This has now been fully implemented. The full reindexing frequency and incremental reindexing frequency are shown at https://searchmysite.net/admin/add/ . Once it has settled down and the impact of the more frequent indexing is clear, the hope is to get incremental reindexing for basic and free trial to every 7 days and for full listings to every day. It is implemented via new database fields which store both the last_index_completed time and the last_full_index_completed time, and the indexing type (full or incremental) is determined with a
where full_index is TRUE for a full reindex and FALSE for an incremental reindex (if both would be triggered the full index is picked first). Note that the current indexing_page_limit may looks like it is limiting the usefulness of the incremental reindex for basic listings, because most basic listings have hit the indexing_page_limit so will not perform an incremental reindex. Even if allowing each incremental reindex to go slightly over the limit, that is likely just to add some random pages rather than newly added pages, because the pages already in the index probably won't be the most recent (the spider does not find links in a particular order, e.g. newest links first, and sometimes even detects links in a different order between runs, because there are many threads running concurrently). |
As per the post searchmysite.net: The delicate matter of the bill one of the less desirable "features" of the searchmysite.net model is that it burns up a lot of money indexing sites on a regular basis even if no-one is actually using the system. It would therefore be good to try to reduce indexing costs.
One idea is to only reindex sites and/or pages which have been updated. It doesn't look like there is a reliable way of doing this though, e.g. given only around 45% of pages in the system currently return a Last-Modified header, so there may need to be some "good enough" only-if-probably-modified approach.
For the only-if-probably-modified approach, one idea may be to store the entire home page in Solr, and at the start of reindexing that site compare the last home page with the new home page - if they are different, then proceed with reindexing that site, and if they are the same, do not reindex that site. There are some issues with this, e.g. if the page has some auto-generated text which changes on each page load, e.g. a timestamp, it will always register as different even if it isn't, and conversely there may be pages within the site which have been updated even if the home page hasn't changed at all. It might therefore be safest to have, e.g. a weekly only-if-probably-modified reindex and monthly reindex-everything-regardless (i.e. the current) approach as a fail-safe.
The text was updated successfully, but these errors were encountered: