Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to stream the terms matching an Automaton #297

Merged
merged 14 commits into from
May 11, 2018

Conversation

drusellers
Copy link
Contributor

Closes #273

Ok, so I've started down the path, but I'm for sure lost.

  1. Not sure how I would test with Automaton - still reading up on the fst code there
  2. Not sure how to use the fst StreamBuilder, but it feels close on the code implementation, but I have compiler issues, which could just be about spreading the generic type love around more.

@@ -101,7 +102,7 @@ fn open_fst_index(source: ReadOnlySource) -> fst::Map {
/// The term dictionary contains all of the terms in
/// `tantivy index` in a sorted manner.
///
/// The `Fst` crate is used to assoicated terms to their
/// The `Fst` crate is used to assoicate terms to their
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe there is still the typo :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! that's what I get. ;p

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix the typo:

assoicate => associate

use fst::map;

// given an Automaton and a fst::Map (fst_index)
// how can I generate a streambuilder?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the question here???

Copy link
Collaborator

@fulmicoton fulmicoton May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh gotcha:

    pub fn search<'a, A: Automaton>(&'a self, automaton: A) -> TermStreamerBuilder<'a, A> {
        let sb = self.fst_index.search(automaton);
        TermStreamerBuilder::<A>::new(self, sb)
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@fulmicoton
Copy link
Collaborator

Can you push those to the tantivy repo as feature branches in the future? This will make it possible for me to commit stuff.

@drusellers
Copy link
Contributor Author

Can you push those to the tantivy repo as feature branches in the future? This will make it possible for me to commit stuff.

Absolutely. :) Do you have any preferences for branch names in the tantivy repo?

Also, for the life of me I can't get the test to compile. :/

error[E0277]: the trait bound `fst_levenshtein::Levenshtein: fst::Automaton` is not satisfied
   --> src/termdict/mod.rs:459:31
    |
459 |         let range = term_dict.search(automaton).into_stream();
    |                               ^^^^^^ the trait `fst::Automaton` is not implemented for `fst_levenshtein::Levenshtein`

But it appears to be implemented right here.

https://github.com/BurntSushi/fst/blob/master/fst-levenshtein/src/lib.rs#L142

@drusellers
Copy link
Contributor Author

Should I be using fst_levenshtein::Levenshtein to test this code or something else?

@fulmicoton
Copy link
Collaborator

Alternatively you can use https://github.com/tantivy-search/levenshtein-automata or write your own. You still have a compilation error?

@drusellers
Copy link
Contributor Author

Yeah, same error. Then I saw your impl of Levenshtein and wondered if that is a better "Test"

@drusellers
Copy link
Contributor Author

error[E0277]: the trait bound `levenshtein_automata::DFA: fst::Automaton` is not satisfied
   --> src/termdict/mod.rs:464:31
    |
464 |         let range = term_dict.search(dfa).into_stream();
    |                               ^^^^^^ the trait `fst::Automaton` is not implemented for `levenshtein_automata::DFA`

:(

@fulmicoton
Copy link
Collaborator

fulmicoton commented May 11, 2018

Hmmm.... The only think I can think of, is version of fst conflicting for some reason.
Let me try on my computer.

@fulmicoton
Copy link
Collaborator

Yeah it compiles on my computer. Maybe try rm -f Cargo.lock?

@drusellers
Copy link
Contributor Author

Ok, did a cargo clean, cargo update - no love. I could try updating fst to 0.3.0

@drusellers
Copy link
Contributor Author

@fulmicoton sweet. one step in the right direction. :)

@drusellers
Copy link
Contributor Author

cargo 1.26.0 (0e7c5a931 2018-04-06) FWIW

@drusellers
Copy link
Contributor Author

cargo build is 👍
cargo test is 👎

just to be clear

@fulmicoton
Copy link
Collaborator

Ok cargo test fails on my computer as well. Let me check if I can solve that riddle.

@fulmicoton
Copy link
Collaborator

It works if you change tantivy's fst version to 0.3.

If you want to use tantivy's levenshtein_automata crate, be careful.
The implementation of fst::Automaton is guarded by a feature called fst_automaton.

So

levenshtein_automata = {version="0.1", features=["fst_automaton"]}

is the right way to add the dependency in tantivy's Cargo.toml

@coveralls
Copy link

coveralls commented May 11, 2018

Coverage Status

Coverage increased (+0.02%) to 91.386% when pulling f9091a4 on drusellers:fuzzy into 82d8741 on tantivy-search:master.

@drusellers drusellers changed the title WIP: Fuzzy Search Fuzzy Search May 11, 2018
@@ -28,6 +29,7 @@ serde_derive = "1.0"
serde_json = "1.0"
num_cpus = "1.2"
itertools = "0.5.9"
levenshtein_automata = {version="0.1", features=["fst_automaton"]}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably make this optional?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's ok.

Cargo.toml Outdated
@@ -17,7 +17,8 @@ byteorder = "1.0"
lazy_static = "0.2.1"
tinysegmenter = "0.1.0"
regex = "0.2"
fst = {version="0.2", default-features=false}
fst = {version="0.3", default-features=false}
fst-levenshtein = "0.2"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be optional - maybe behind a fuzzy feature gate until the whole feature is done?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove fst-levenshtein and keep only levenshtein-automata. The latter does the same thing, but faster, better and stronger.

MmapReadOnly::open(&file)
.map(Some)
.map_err(|e| From::from(IOError::with_path(full_path.to_owned(), e)))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this change was required when I updated fst -> 0.3 (could be new rust stable too) - but would want someone else's eyes on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the unsafe ? There is a comment about it in the fst crate.
It was modified in a non-backward compatible way in 0.3. As a general rule any change from 0.x to 0.y may change backward compatibility.
For version > 1.x, backward compatilibility is respected as long as the major version does not change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, since it was an unsafe change I just wanted someone else to look at it. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To be really "philosophically" correct here, we should probably make the caller unsafe... I think this is borderline enough to not care.)

@@ -42,7 +42,7 @@ impl ReadOnlySource {
pub fn as_slice(&self) -> &[u8] {
match *self {
#[cfg(feature = "mmap")]
ReadOnlySource::Mmap(ref mmap_read_only) => unsafe { mmap_read_only.as_slice() },
ReadOnlySource::Mmap(ref mmap_read_only) => mmap_read_only.as_slice(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this change was required when I updated fst -> 0.3 (could be new rust stable too) - but would want someone else's eyes on it.

/// a range of terms that should be streamed.
pub struct TermStreamerBuilder<'a> {
pub struct TermStreamerBuilder<'a, A>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make A=AlwaysMatch a default

@@ -58,15 +65,21 @@ impl<'a> TermStreamerBuilder<'a> {

/// `TermStreamer` acts as a cursor over a range of terms of a segment.
/// Terms are guaranteed to be sorted.
pub struct TermStreamer<'a> {
pub struct TermStreamer<'a, A>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make A=AlwaysMatch a default

use schema::Term;
use std::cmp::Ordering;
use std::collections::BinaryHeap;
use termdict::TermOrdinal;
use termdict::TermStreamer;

pub struct HeapItem<'a> {
pub streamer: TermStreamer<'a>,
pub streamer: TermStreamer<'a, AlwaysMatch>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless once you set the default make A=AlwaysMatch`

@@ -44,7 +45,7 @@ pub struct TermMerger<'a> {
impl<'a> TermMerger<'a> {
/// Stream of merged term dictionary
///
pub fn new(streams: Vec<TermStreamer<'a>>) -> TermMerger<'a> {
pub fn new(streams: Vec<TermStreamer<'a, AlwaysMatch>>) -> TermMerger<'a> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless once you set the default make A=AlwaysMatch`

@@ -381,7 +383,7 @@ mod tests {
let source = ReadOnlySource::from(buffer);
let term_dictionary: TermDictionary = TermDictionary::from_source(source);

let value_list = |mut streamer: TermStreamer| {
let value_list = |mut streamer: TermStreamer<AlwaysMatch>| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless once you set the default make A=AlwaysMatch`

@@ -366,6 +366,8 @@ mod tests {

#[test]
fn test_stream_range_boundaries() {
use fst::automaton::AlwaysMatch;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless once you set the default make A=AlwaysMatch`

@@ -191,12 +193,19 @@ impl TermDictionary {

/// Returns a range builder, to stream all of the terms
/// within an interval.
pub fn range<'a>(&'a self) -> TermStreamerBuilder<'a> {
pub fn range<'a>(&'a self) -> TermStreamerBuilder<'a, AlwaysMatch> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless once you set the default make A=AlwaysMatch`

TermStreamerBuilder::new(self, self.fst_index.range())
}

/// A stream of all the sorted terms. [See also `.stream_field()`](#method.stream_field)
pub fn stream<'a>(&'a self) -> TermStreamer<'a> {
pub fn stream<'a>(&'a self) -> TermStreamer<'a, AlwaysMatch> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless once you set the default make A=AlwaysMatch`

@@ -215,7 +216,7 @@ pub struct RangeWeight {
}

impl RangeWeight {
fn term_range<'a>(&self, term_dict: &'a TermDictionary) -> TermStreamer<'a> {
fn term_range<'a>(&self, term_dict: &'a TermDictionary) -> TermStreamer<'a, AlwaysMatch> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless once you set the default make A=AlwaysMatch`

Default Types FTW
@drusellers
Copy link
Contributor Author

@fulmicoton thank you for helping me with this. It was cool to see this work. I'm now excited to try and close out some more of the fuzzy related issues. 👍

/// Returns a search builder, to stream all of the terms
/// within the Automaton
pub fn search<'a, A: Automaton>(&'a self, automaton: A) -> TermStreamerBuilder<'a, A> {
let sb = self.fst_index.search(automaton);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid riddly short names like sb. especially where the type is non-trviail to the reader.
sb => stream_builder.

@fulmicoton
Copy link
Collaborator

kudos! We're almost there.

@fulmicoton fulmicoton changed the title Fuzzy Search Fuzzy search the Term Dictionary May 11, 2018
@fulmicoton
Copy link
Collaborator

@drusellers That was the hardest part, right there! I'll fill another ticket to get the next step toward a fuzzy search.

@fulmicoton fulmicoton changed the title Fuzzy search the Term Dictionary Make it possible to stream the terms matching an Automaton May 11, 2018
src/lib.rs Outdated
@@ -136,9 +136,11 @@ extern crate combine;
extern crate crossbeam;
extern crate fnv;
extern crate fst;
extern crate fst_levenshtein;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove fst_levenshtein now?

@fulmicoton
Copy link
Collaborator

We're almost good to go. Can you just remove the deps to fst-levenshtein?

@fulmicoton fulmicoton merged commit 08d2cc6 into quickwit-oss:master May 11, 2018
@drusellers drusellers deleted the fuzzy branch May 11, 2018 22:52
drusellers added a commit to drusellers/tantivy that referenced this pull request Aug 18, 2018
…oss#297)

* rustfmt and some English grammar

* sort cargo.toml crates

* WIP: something to show

* Remove example for now

* Implement desired method

* Resolving Generic Type Arguments

* Resolve Generic Types

* Banging around on the tests

* DANGER! Change unsafe usage based on compiler warnings

* Unscrew up my rebase

* Clean Up Type Spam

Default Types FTW

* typo

* better variable names

* Remove Duplicate Levenshtein crate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a .search(automaton: fst::Automaton) to the TermDictionary
3 participants