Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to write the index from independent thread easily. #550

Closed
fulmicoton opened this issue May 22, 2019 · 25 comments
Closed

Comments

@fulmicoton
Copy link
Collaborator

(This is a follow up from #549)

Ideally what I need is the ability to write to index from independent threads each having own writer.

This problem is very common and hurts some Very Important Project relying on tantivy (toshi, plume). (Invoking the name @fdb-hiroshima and @hntd187 for the discussion).

Web server are typically multithreaded and requests may spawn the need to add or delete documents. Dealing with a Arc<RwLock<IndexWriter>> might feel dirty, and rust beginners may not really understand the logic behind that.

On the other hand, the IndexWriter already relies on a channel to dispatch indexing operation to its own small thread pool. Stamping is also done using Atomics. There is actually no real reason to prevent .add_document() and and .delete_term() to happen from different threads.

The problem is

  • what should happen with .commit()
  • what should happen with .rollback()

Especially would this ability to index from several threads confuse people on

  • what a commit is in tantivy? (.commit() ensure that all operations that happened but the .commit() are processed)
  • how multithreaded indexing works ? (indexing from several thread will no help. Tantivy has its own multithreading system).

Also,

  • should .commit() and .rollback() block other operations? (It is technically possible to have .commit() only block other .commit() operations.)
  • Should .commit() and .rollback() return futures?
@fulmicoton
Copy link
Collaborator Author

fulmicoton commented May 22, 2019

A no brainer for next version of tantivy would be to remove the make &mut self -> &self change on add_document.

Toshi could then exchange its Mutex for a RWLock and remove some contention.

@hntd187
Copy link
Contributor

hntd187 commented May 22, 2019

That would wrap up a lot of contention especially if you defer the commit for too long. Another very small change I think makes sense is to expose IndexMeta outside tantivy, currently it's only pub in it's module, but not exported outside the crate, that would make generating index summaries for indexes located on many machines much easier.

I also have some thoughts on the IndexWriter above, but it's late, I'll write them tomorrow after a bit of research.

@kkonevets
Copy link

kkonevets commented May 22, 2019

That would be a super cool feature. My end users are admiring of tantivy's read speed and if write from independent threads would be possible it would be fantastic. You make a great work, thanks!

By the way, if I use Arc<RwLock<IndexWriter>> wouldn't it fall with error if I clone such object for each thread? Actually I use Actix-Web framework to spawn threads.

@hntd187
Copy link
Contributor

hntd187 commented May 22, 2019

@kkonevets no I do it that way, it works fine, you just do Arc::clone(&writer) you're cloning the Arc, not the writer itself. I should mention though that obviously while each thread has a copy of the Arc, they will block when you take a mut ref to do actual writing.

@kkonevets
Copy link

kkonevets commented May 22, 2019

@hntd187 Actix-web does not allow to explicitly clone the object with Arc::clone(&writer)
Instead it postulates a notion of State which is cloned when spawning workers. So when I set Arc<RwLock<IndexWriter>> object in state it falls with error
Failed to acquire Lockfile: LockBusy. Some("Failed to acquire index lock. If you are usinga regular directory, this means there is already an IndexWriterworking on thisDirectory, in this process or in a different process.")

It's documentation says The server creates a separate application instance for each created worker. Application state is not shared between threads. To share state, Arc could be used. So I guess we do the right thing

@kkonevets
Copy link

I guess it would be nice to implement the feature in tantivy without the need to depend on the architecture of a specific web framework.

@fulmicoton
Copy link
Collaborator Author

@kkonevets Can you share the code you are talking about? The error you pointed is not called by a call to IndexWriter...

@kkonevets
Copy link

kkonevets commented May 22, 2019

@kkonevets
Copy link

kkonevets commented May 22, 2019

@fulmicoton I a sorry, have just committed the changes, pull if already cloned
look at ModifyState struct insrc/state

Just run cargo run --bin server to see the error

@fulmicoton
Copy link
Collaborator Author

Your problem is that multiple State object are created. Not that the Arc gets cloned.

@fulmicoton
Copy link
Collaborator Author

Can you try the following...

  let modify_state = ModifyState::new().unwrap();
   HttpServer::new(move || {
        vec![
            App::with_state(SearchState::new().unwrap())
                .prefix("/search")
                .resource("", |r| r.method(http::Method::POST).with(search_index))
                .boxed(),
            App::with_state(modify_state.clone())
                .prefix("/modify")
                .resource("", |r| r.method(http::Method::POST).with(modify_index))
                .boxed(),
        ]
    })
    .bind(host)
    .unwrap()
    .start();

@kkonevets
Copy link

error[E0277]: (dyn std::ops::Drop + std::marker::Send + 'static) cannot be shared between threads safely
--> src/bin/server.rs:138:5
|
138 | HttpServer::new(|| {
| ^^^^^^^^^^^^^^^ (dyn std::ops::Drop + std::marker::Send + 'static) cannot be shared between threads safely
|
= help: the trait std::marker::Sync is not implemented for (dyn std::ops::Drop + std::marker::Send + 'static)
= note: required because of the requirements on the impl of std::marker::Sync for std::ptr::Unique<(dyn std::ops::Drop + std::marker::Send + 'static)>
= note: required because it appears within the type std::boxed::Box<(dyn std::ops::Drop + std::marker::Send + 'static)>
= note: required because it appears within the type tantivy::directory::DirectoryLock
= note: required because it appears within the type std::option::Option<tantivy::directory::DirectoryLock>
= note: required because it appears within the type tantivy::IndexWriter
= note: required because of the requirements on the impl of std::marker::Sync for std::sync::RwLock<tantivy::IndexWriter>
= note: required because of the requirements on the impl of std::marker::Send for std::sync::Arc<std::sync::RwLock<tantivy::IndexWriter>>
= note: required because it appears within the type tsearch::state::ModifyState
= note: required because it appears within the type [closure@src/bin/server.rs:138:21: 149:6 sstate:tsearch::state::SearchState, modify_state:tsearch::state::ModifyState]
= note: required by <actix_web::server::HttpServer<H, F>>::new

@kkonevets
Copy link

kkonevets commented May 22, 2019

@fulmicoton thank you for your comment, I did create multiple State objects, my fault
But now it seems that DirectoryLock (that is in IndexWriter) does not implement a Sync trait and can not be shared between threads safely. How come IndexWriter can be shared? (also @hntd187 )

DirectoryLock is Send but not Sync. This means it is like a unique_ptr. You can transfer it between threads, but can't clone to pass to a new thread. Or am I completly wrong?

@petr-tik
Copy link
Contributor

This ticket looks interesting and I wanted to add my 2c.

* should `.commit()` and `.rollback()` block other operations? (It is technically possible to have `.commit()` only block other `.commit()` operations.)

I think blocking is fine. I estimate that most multi-threaded uses of tantivy are read-heavy >>> write-heavy. Blocking every once in a while should be ok. Unless, I am missing some obvious write-heavy use case, when reading needs to be non-blocking.

* Should `.commit()` and `.rollback()` return futures?

AFAIK, Futures imply an event-loop, which increases binary size and adds another thread model. What killer functionality do Futures add for our typical current and future use cases?

@fulmicoton
Copy link
Collaborator Author

@kkonevets Yeah. I need to mark the DirtectoryLock as Sync. In the meanwhile you can solve your problem by using a Mutex instead of a DirectoryLock.

@fulmicoton
Copy link
Collaborator Author

@petr-tik Your 2c are always welcome :).

I estimate that most multi-threaded uses of tantivy are read-heavy >>> write-heavy.

That's more or less true. There are some analytics usage where write are more frequent than reads...
But still we are talking about commits here, and we can consider read >>> commit operation.

I am suggesting futures here, not for performance reason at all, but for ergonomy.
I'm thinking of people trying to build a webserver in which a request handle commits without blocking for instance. Right now, it will require running commit from a thread pool.

@petr-tik
Copy link
Contributor

petr-tik commented May 22, 2019

I am going to make it 4c total, by adding another 2c.

I am suggesting futures here, not for performance reason at all, but for ergonomy.
I'm thinking of people trying to build a webserver in which a request handle commits without blocking for instance. Right now, it will require running commit from a thread pool.

Can we hide it behind a feature-flag or do you want to redesign the current commit pipeline to be async/futures-based?

I would use rust handy support for conditional compilation to make it a feature flag.

Thinking about it. If we ever commit to implementing a wasm backend #541, we can re-use the Future feature on webserver and wasm front-end that will convert function returns to JS Promises (I think, my JS understanding is very hand-wavy)

@hntd187
Copy link
Contributor

hntd187 commented May 23, 2019

If you're gonna add that future functionality it might be better to build it on where nightly is going or wait until the async/await lands in stable. Don't implement a contract that's going away soon.

@fulmicoton
Copy link
Collaborator Author

fulmicoton commented May 24, 2019

@petr-tik @hntd187 both of your points are super valid. Great input guys!

Assuming it does not imply any regression in perf, do I internalize the Arc<RwLock> in the IndexWriter, and return a Arc cloned without an error when the user call index.index_writer() ?

@fulmicoton
Copy link
Collaborator Author

Internalizing the RwLock is a bit tricky because of PreparedCommit...

As a general rule, I try as much as possible to rule out the chances to poison locks.
That means I ensure that during the lifetime of such a lock I never do io or call user defined code, or anything of that kind.

The straightforward implementation of Internalizing the RwLock could for instance require to have the WriteLockGuard in the PreparedCommit object but we would then allow for the user to write faulty code and poison this lock, or create a deadlock.

@kkonevets
Copy link

@fulmicoton I am sorry for a dilettante question. What do you mean by user code? How is it possible for the user to call something from PreparedCommit? Is PreparedCommit::set_payload() somehow related to user code?

@fulmicoton
Copy link
Collaborator Author

fulmicoton commented May 28, 2019

@kkonevets Sorry for the sparse information.

If the thread holding a RwlockGuard panicks. The lock is "poisoned". That means that trying to acquire the lock will return an error in the future.

I want to avoid the situation were a client of tantivy is unaware that a PreparedCommit wraps a RwLock guard under the hood. The risk is that if the client does something like that.

let prepared_commit = index_writer.prepare_commit()?;
some_operation_that_panics();

He will get an error next time he tries to add docs to the index.
The relationship between this error and the possibly "legitimate" panic quite opaque.

If we don't internalize the RwLock, user that want to use IndexWriter in a multithreaded environment will have to write a little bit more, but they will explicitly have to deal with the lock guards.

@fulmicoton
Copy link
Collaborator Author

closing the issue.

@niyue
Copy link

niyue commented Aug 28, 2019

@fulmicoton I am new to both Rust and Tantivy. To get a shared index writer in multiple threaded env, I am fighting with Rust's mut ref for a whole day. I wonder what the right approach to write the index from independent thread currently using v0.10.1?

A no brainer for next version of tantivy would be to remove the make &mut self -> &self change on add_document.

I see add_document doesn't require mut in latest version, since there is no summary for this this issue, does it mean the best approach of doing this is using Arc<RwLock<IndexWriter>>? Or is there any other easier approach? Thanks.

@fulmicoton
Copy link
Collaborator Author

Yeah you are still stuck with Arc<RwLock>. Just take the read lock for adding documents, and the write lock for committing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants