Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: Index wikipedia #10

Closed
m-i-l opened this issue Dec 5, 2020 · 7 comments
Closed

Indexing: Index wikipedia #10

m-i-l opened this issue Dec 5, 2020 · 7 comments
Labels
enhancement New feature or request

Comments

@m-i-l
Copy link
Contributor

m-i-l commented Dec 5, 2020

A user submitted https://en.wikipedia.org/ via Quick Add. As per my rejection note (which you can see by trying to resubmit) I would love to index wikipedia, but it would require custom dev and likely an infra upgrade.

The big advantage of including wikipedia would be that it would turn searchmysite.net from a niche search into a more general search, and therefore give the site more "stickiness". It wouldn't be a departure from the original philopsophy, which is (among other things) to index just the "good stuff", to penalise pages with adverts, and to focus on personal and independent websites at first (I think wikipedia still falls under the category of "independent website").

However, given the 6M+ English pages, and 20M+ other languages, spidering it via the normal approach would not be a good idea. Indeed the page at https://en.wikipedia.org/wiki/Wikipedia:Database_download even says "Please do not use a web crawler to download large numbers of articles." A better idea would be to periodically download the database, and have a custom indexer for that database. The tblIndexedDomains could have a column added for indexer type. It may require some Solr schema changes too in order to get the most out of it. It would have to be listed as not owner verified, and of course an exception made to increase the 50 page non-owner verified page limit.

Not a trivial undertaking, and it would almost certainly require a CPU, memory, and disk upgrade for the production sever, i.e. increase running costs. But not completely out of the question either.

@m-i-l m-i-l added the enhancement New feature or request label Dec 5, 2020
@m-i-l m-i-l changed the title Indexing: index wikipedia Indexing: Index wikipedia Dec 14, 2020
@ScootRay
Copy link

It may just be me but I'm not sure it's the best thing to do. Anyone can go to WIkipedia to dig up stuff and it would seem redundant. My biggest concern is having to wade through wiki material when I don't want to in the first place. It's very hard to find highly relevant and highly focused search engines that cover specific areas, so Wikipedia may dilute that value.

Just my humble thoughts : )

Ray

@m-i-l
Copy link
Contributor Author

m-i-l commented Apr 14, 2021

Thanks for your feedback.

Search results on searchmysite.net are grouped by domain, so if indexing wikipedia, the "worst" that could happen (from the user's perspective) is that there's one extra group of results for every search query (and from my perspective the worst that could happen is that it doubles the running costs).

I do still think it could make it a search engine people would be prepared to use on a slightly more frequent basis, e.g. instead of looking something up on wikipedia, look it up on here to get the wikipedia link and see if anyone has written anything interesting about the topic.

And I still like the idea of ultimately turning it into a more general purpose search engine, with the crucial differentiator that it only searches the useful and interesting parts of the web, potentially still under the "personal and independent websites" categories (although given the amount of time I've been able to spend on it recently that's probably a fair way off).

If it sounds like I'm trying to talk myself into doing this, I have to admit it is partly because indexing wikipedia is also simply an itch I'd like to scratch:-) I could always remove it if it wasn't useful.

BTW, some people have talked about liking the idea of a search where you can suppress results from certain domains, so that could be an option, although it may require user profiles to be more useful, and that's something I'm trying to avoid.

m-i-l added a commit that referenced this issue Oct 9, 2021
m-i-l added a commit that referenced this issue Oct 9, 2021
m-i-l added a commit that referenced this issue Oct 9, 2021
…ory issues with a large number of documents, for #10 Index wikipedia
m-i-l added a commit that referenced this issue Oct 9, 2021
…ce max indexing memory consumption, given Solr memory increased for #10 Index wikipedia
@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 10, 2021

Written and deployed the bulk import scripts, added indexing_type column to tblIndexedDomains to allow for different forms of indexing, set indexing_type to 'spider/default' for everything indexed at the moment and updated indexing and management scripts accordingly, and moved wikipedia.org from tblExcludeDomains to tblIndexedDomains with an indexing_type of 'bulkimport/wikipedia'.

Also increased the Solr Java memory, decreased the number of concurrent sites that can be indexed (to reduce memory), and removed a CPU/memory intensive clause from the boost (relevancy tuning).

Now wikipedia.org is in the tblIndexedDomains, it will mean sites being indexed or reindexed will have wikipedia links included in their indexed_outlinks, so when wikipedia itself is indexed it can determine the correct indexed_inlinks (and indexed_inlink_domains etc.), for the PageRank-like relevancy tuning. It'll take 28 days for all sites to be reindexed naturally, i.e. without a forced reindex.

@ScootRay
Copy link

Sounds good. Glad you are still working on this, great software, Thanks!

@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 17, 2021

Sounds good. Glad you are still working on this, great software, Thanks!

@ScootRay Many thanks for your support. It is always great to hear from users, especially when it is positive feedback.

@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 17, 2021

The Wikipedia indexing script itself is automated, checking what Wikipedia export was used for the last import and whether there is a new one available. However, it does require around 150Gb storage while it is running, which is a lot to be paying for when unused, so I'm increasing storage while it is running and decreasing storage afterwards, and this is not something I have automated. So the script is run manually rather than via a scheduled job.

I hope to write a blog post with more information on how it all works in the next 2-3 weeks. Until then, closing this as complete.

@m-i-l m-i-l closed this as completed Oct 17, 2021
@m-i-l
Copy link
Contributor Author

m-i-l commented Oct 30, 2021

Blog entry with plenty of further details at https://blog.searchmysite.net/posts/searchmysite.net-now-with-added-wikipedia-goodness/ .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants