Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/solrcloud #138

Closed
wants to merge 45 commits into from

Conversation

upayavira
Copy link
Contributor

An extension to pysolr to make it Zookeeper/SolrCloud aware. This is cloned from the code in the SolrJ client.
The tests are limited to proving that this does not break existing functionality, although I have tested (manually) that it does correctly failover between nodes when a node in a cluster fails.

Commit Checklist

  • Test coverage for ZooKeeper / SolrCloud error states
  • Add a Travis test matrix which runs without Kazoo installed to confirm that nothing breaks for traditional usage (the SolrCloud tests are supposed to be skipped)
  • Support SolrCloud 5 and have a Travis test matrix entry for both major versions
  • Add test that confirms that Pysolr fails-over correctly when one of the Solr nodes disappears (can be simulated with kill -STOP and kill -CONT)

@danizen
Copy link

danizen commented Mar 31, 2015

+1 - I need this in my environment.

mylanium added a commit to mylanium/pysolr that referenced this pull request Sep 3, 2015
@upayavira
Copy link
Contributor Author

Any suggestions as to what is holding this patch back from a merge? Thx!

(I notice there's conflicts which I will resolve)

@mmroden
Copy link

mmroden commented Jan 26, 2016

I would really like this feature to be merged in; what needs to be done to make that happen, apart from conflict resolution?

@acdha
Copy link
Collaborator

acdha commented Jan 26, 2016

Beyond conflict cleanup & a quick style review, the main thing I'd want to see would be updating the test runner so it continues to test regular Solr along with running an instance with SolrCloud enabled.

The install packaging also might want a little review – this was rolled into the Tomcat target but seems like it should be a separate one, particularly since the Solr project has itself deprecated deployment on Tomcat.

@acdha
Copy link
Collaborator

acdha commented Jan 26, 2016

That said, if you have time to work on this, I can review & ship a release.

@upayavira
Copy link
Contributor Author

I can look into that, @acdha

@acdha
Copy link
Collaborator

acdha commented Jan 26, 2016

@upayavira awesome, thanks!

@upayavira
Copy link
Contributor Author

@acdha I've made some reasonably substantial changes to this PR. Basically, instead of starting Solr and getting on with it, we now need the tests to start/stop the correct type of Solr, so I improved your start-solr-test-server.sh script to prepare and start both non-cloud and cloud instances, and then extended the tests to make them able to start the right Solr in their setupClass method.

By extending the SolrTestCase with a SolrCloudTestCase, I was able to have most of your existing tests run a second time against a solrcloud instance.

The tests seem to consistently pass, but for some reason seem to hang at the end, for reasons I have yet to identify.

I am considering rewriting start-solr-test-server.sh in Python, but more importantly I want/need a test that hits Solr repeatedly whilst killing one of the two nodes that contains the collection being queried. Note, it starts three Solr nodes, one for Zookeeper and two to host collections. This means we can kill either of the collection nodes without disrupting Zookeeper.

Please let me know what you make of this (rather substantial) PR. Comments/suggestions welcome, before I start writing the above, more complex, test.

@mmroden
Copy link

mmroden commented Feb 10, 2016

For what it's worth, the tests work on my machine (macos 10.10.5) once I change to the feature/solrcloud branch and I install kazoo. I have these messages, but I think they're in kazoo, not pysolr:

Waiting for simple-solr ----No handlers could be found for logger "kazoo.client"
No handlers could be found for logger "kazoo.client"

Upayavira and others added 14 commits February 10, 2016 13:55
This follows the capitalization used on http://zookeeper.apache.org/
* Don't force string interpolation for performance and compatibility
  with logging tools like Raven/Sentry
* Pass full exceptions to logging
* Remove unused variables
* Remove backwards-compatibility imports
* Rename tests so unittest2 discovery works out of the box
* PEP-8 import sorting & whitespace
This avoids leaking file handles until the end of the test run
@acdha
Copy link
Collaborator

acdha commented Feb 10, 2016

@upayavira I've made some changes in upayavira#1. I definitely agree that we're probably close to the point where we should just port the server launcher to Python since we have a bunch of daemons to manage now and will want things like timeouts for hard-kills on the various processes.

* Remove unused BACKGROUND_SOLR option
* prepare stops possible stale instances on startup
* More detailed progress for Solr process management
* Consistent ZooKeeper name
* run-tests now includes output for major events
* More consistent indentation
The call signature changes from Kazoo 2.0 to 2.2 and since
we don’t care either way we can future-proof it.
@upayavira
Copy link
Contributor Author

I presume by 100% test coverage you mean that all tests pass? That should now be the case. I also raise a few more explicit SolrErrors in there too. Let me know what you think.

@acdha
Copy link
Collaborator

acdha commented Feb 16, 2016

@upayavira I was thinking that we have 100% coverage of the new code by the test suite. It looked like most of the remaining gaps were in error-handling, which is definitely easy to miss in future changes.

acdha added a commit to acdha/pysolr that referenced this pull request Feb 16, 2016
This optionally adds support for SolrCloud using the Kazoo client
library.

Thanks to @upayavira
@acdha
Copy link
Collaborator

acdha commented Feb 16, 2016

master now has a few minor updates, the most notable being acdha@d23de58 to ensure that the byte-strings containing JSON are decoded first before passing them to json.loads

@acdha acdha added this to the v4.0.0 milestone Feb 16, 2016
@acdha acdha added the feature label Feb 16, 2016
@acdha acdha self-assigned this Feb 16, 2016
@acdha
Copy link
Collaborator

acdha commented Feb 16, 2016

acdha@cbc07af adds an env: option to Travis CI so the tests are run with and without Kazoo installed, with the ~32 SolrCloud tests skipped in the latter case.

@upayavira
Copy link
Contributor Author

I have some initial/prototype code that should work with both 4.x and 5.x for SolrCloud, and also implements some of the Collections API for creating/deleting collections/etc. I'll post it as a fresh PR once I've played with it a bit more. Really though, as with the above code, it will need some decent tests around it to prove it behaves properly in failover scenarios.

@mmroden
Copy link

mmroden commented Mar 7, 2016

Hi all,

Where do we stand on this? Is there a sense that pysolr 4.0 will be coming out soon-ish (ie, next two weeks), or should I go ahead and use this branch as a one-off in a private pypi?

Thanks!

@vvolkman
Copy link

vvolkman commented Mar 8, 2016

+1 I really need the custom search_handler in a released version. Tired of pulling from the tip instead of the pip. thanks in advance!

@acdha
Copy link
Collaborator

acdha commented Mar 8, 2016

I don't think we're ready for 4.0 yet unless someone has time to help add tests – in particular, it'd be good to start testing with Solr 5 as well as 4.x.

@upayavira
Copy link
Contributor Author

I have a test for failover which seems to work (it passes). I need to add one in which both nodes are down and confirm that it fails.

This code WILL NOT work with Solr 5.x. I've got as far as working out exactly what we need to do (and have some hacked, unpublishable code), but just need to find the time to make it happen.

@danizen
Copy link

danizen commented Mar 8, 2016

Upayavira, can you expand on where it won't work? I have at work pysolr
working with Solr 5 in a non-cloud mode, and I'm anticipating a switch to
cloud.

I have a test for failover which seems to work (it passes). I need to add
one in which both nodes are down and confirm that it fails.

This code WILL NOT work with Solr 5.x. I've got as far as working out
exactly what we need to do (and have some hacked, unpublishable code), but
just need to find the time to make it happen.


Reply to this email directly or view it on GitHub
#138 (comment).

@upayavira
Copy link
Contributor Author

When Solr 4.0 released, SolrCloud stored its cluster information in clusterstate.json - every collection in a single file in Zookeeper. This worked, but people started building clusters with 1000s of collections, leading to an unmanageable file. Therefore, somewhere around the 5.0 mark it switched to having a per-collection state.json file. This allows a zookeeper client to follow just the collections it is interested in.

I don't see any reason why the client cannot be coded to handle both setups. We would need a way to test against both, though, which might require two different Solr installs to run our tests against.

The consequence is that you will get "collection not found" errors, even though you know the collection exists (pysolr looks for clusterstate.json but its state is actually in /collections/$COLLECTION/state.json).

@upayavira
Copy link
Contributor Author

See #187 for further work here. Should we close this PR now?

@acdha
Copy link
Collaborator

acdha commented Mar 10, 2016

@upayavira We have 3 unchecked tests in the todo list – looking like they might better as separate issues since some of them have a fair amount of work?

@upayavira
Copy link
Contributor Author

Test coverage for ZooKeeper / SolrCloud error states

If you can clarify what you are looking at here, I can look into that.

Support SolrCloud 5 and have a Travis test matrix entry for both major versions
Solr 5.x support is in PR #187 .

I plan to pull the start-solr-test-server.sh for 4.10, rename it start-solr-test-server-4.10.1.sh, and add a way that run-tests.py can use one or the other. That'll mean we can have a Travis run for each version, both using the latest (5.x compatible) code.

Add test that confirms that Pysolr fails-over correctly when one of the Solr nodes disappears (can be simulated with kill -STOP and kill -CONT)

This test is, again, in PR #187 . I accept that I should have separated PR 187 into three PRs, but they kinda all happened at the same time. If you want them separating, let me know and I'll see how possible that is.

@acdha
Copy link
Collaborator

acdha commented Mar 11, 2016

@upayavira The main thing I was thinking is confirming that we do something reasonable when ZooKeeper itself has either failed outright or is unresponsive. I think that'd just be a question of making sure our timeouts are used and that something useful is logged.

Since you have a test in #187 I checked the failover tests off the list here.

@upayavira
Copy link
Contributor Author

I've been working all morning on the failover tests. Getting them right isn't easy. It seems that a kill -17 does not cause a change the live_nodes entry in ZooKeeper, so we cannot use a kill -17.

Test scenarios:

  • Does it handle when one replica is down for a shard of a collection?
  • Does it fail gracefully when there are no nodes available for a shard/collection?
  • Does it fail gracefully if it cannot connect to ZK, and recover once ZK comes back?

These are non-trivial tests. The code under test is WAY simpler. I'm mulling now on how to implement them, but I'm open to suggestions,

@acdha
Copy link
Collaborator

acdha commented Mar 11, 2016

One thing I've been wondering is whether it makes sense to merge the work-in-progress here and postpone some of the more involved tests or simply defer them entirely to Kazoo or the upstream ZK/Solr projects and instead just use something like mock.patch to confirm that our logging code, etc. works when the important KazooClient methods raise exceptions.

@upayavira
Copy link
Contributor Author

See code in #187 , just committed. I believe it does the job for the above three test scenarios. It turns out it was important to get the tests right, as they did show up one particular bug (not readding a watch when a watch is triggered).

All that is left is support for both 4.x and 5.x. And, by comparison, that is seriously trivial!

@upayavira
Copy link
Contributor Author

I've now got code that works against both 4.x and 5.x. You can set an environment variable to tell it which startup script to use (which version of start-solr-test-server.sh), i.e. the 4.x or the 5.x one. Thus, it should be possible to keep a version of each running in Travis.

This also involved redoing the retry logic somewhat. Solr 4.10 wasn't updating its cluster state as fast as it should, so I made it, by default, retry every 0.02s, and retry 20 times. I saw maybe 4 or 5 retries needed in my tests. These can be overridden on the SolrCloud constructor. I also needed to call random.seed() as random.choice was giving me the same value every time.

I will commit this work against #187 as soon as I get a next moment.

@acdha
Copy link
Collaborator

acdha commented Mar 14, 2016

@upayavira that's excellent news!

@upayavira
Copy link
Contributor Author

Moments have been far apart, but #187 now has my latest changes. You can run the tests against 5.5 with:
./run-tests.py
or against 4.10.1 with:
PYSOLR_STARTER=./start-solr-test-server-4x.sh ./run-tests.py
Add both of these to Travis, and you should have good test coverage.

@stale
Copy link

stale bot commented Jun 5, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 5, 2018
@stale stale bot closed this Jul 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants