-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it possible to stream the terms matching an Automaton #297
Conversation
src/termdict/termdict.rs
Outdated
@@ -101,7 +102,7 @@ fn open_fst_index(source: ReadOnlySource) -> fst::Map { | |||
/// The term dictionary contains all of the terms in | |||
/// `tantivy index` in a sorted manner. | |||
/// | |||
/// The `Fst` crate is used to assoicated terms to their | |||
/// The `Fst` crate is used to assoicate terms to their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe there is still the typo :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha! that's what I get. ;p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fix the typo:
assoicate
=> associate
src/termdict/termdict.rs
Outdated
use fst::map; | ||
|
||
// given an Automaton and a fst::Map (fst_index) | ||
// how can I generate a streambuilder? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the question here???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh gotcha:
pub fn search<'a, A: Automaton>(&'a self, automaton: A) -> TermStreamerBuilder<'a, A> {
let sb = self.fst_index.search(automaton);
TermStreamerBuilder::<A>::new(self, sb)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Can you push those to the tantivy repo as feature branches in the future? This will make it possible for me to commit stuff. |
Absolutely. :) Do you have any preferences for branch names in the tantivy repo? Also, for the life of me I can't get the test to compile. :/
But it appears to be implemented right here. https://github.com/BurntSushi/fst/blob/master/fst-levenshtein/src/lib.rs#L142 |
Should I be using |
Alternatively you can use |
Yeah, same error. Then I saw your impl of Levenshtein and wondered if that is a better "Test" |
:( |
Hmmm.... The only think I can think of, is version of fst conflicting for some reason. |
Yeah it compiles on my computer. Maybe try |
Ok, did a cargo clean, cargo update - no love. I could try updating |
@fulmicoton sweet. one step in the right direction. :) |
|
just to be clear |
Ok |
It works if you change tantivy's If you want to use So
is the right way to add the dependency in tantivy's |
@@ -28,6 +29,7 @@ serde_derive = "1.0" | |||
serde_json = "1.0" | |||
num_cpus = "1.2" | |||
itertools = "0.5.9" | |||
levenshtein_automata = {version="0.1", features=["fst_automaton"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably make this optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that's ok.
Cargo.toml
Outdated
@@ -17,7 +17,8 @@ byteorder = "1.0" | |||
lazy_static = "0.2.1" | |||
tinysegmenter = "0.1.0" | |||
regex = "0.2" | |||
fst = {version="0.2", default-features=false} | |||
fst = {version="0.3", default-features=false} | |||
fst-levenshtein = "0.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be optional - maybe behind a fuzzy
feature gate until the whole feature is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove fst-levenshtein
and keep only levenshtein-automata
. The latter does the same thing, but faster, better and stronger.
MmapReadOnly::open(&file) | ||
.map(Some) | ||
.map_err(|e| From::from(IOError::with_path(full_path.to_owned(), e))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this change was required when I updated fst -> 0.3 (could be new rust stable too) - but would want someone else's eyes on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the unsafe
? There is a comment about it in the fst crate.
It was modified in a non-backward compatible way in 0.3
. As a general rule any change from 0.x
to 0.y
may change backward compatibility.
For version > 1.x, backward compatilibility is respected as long as the major version does not change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, since it was an unsafe change I just wanted someone else to look at it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(To be really "philosophically" correct here, we should probably make the caller unsafe
... I think this is borderline enough to not care.)
@@ -42,7 +42,7 @@ impl ReadOnlySource { | |||
pub fn as_slice(&self) -> &[u8] { | |||
match *self { | |||
#[cfg(feature = "mmap")] | |||
ReadOnlySource::Mmap(ref mmap_read_only) => unsafe { mmap_read_only.as_slice() }, | |||
ReadOnlySource::Mmap(ref mmap_read_only) => mmap_read_only.as_slice(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this change was required when I updated fst -> 0.3 (could be new rust stable too) - but would want someone else's eyes on it.
src/termdict/streamer.rs
Outdated
/// a range of terms that should be streamed. | ||
pub struct TermStreamerBuilder<'a> { | ||
pub struct TermStreamerBuilder<'a, A> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make A=AlwaysMatch
a default
src/termdict/streamer.rs
Outdated
@@ -58,15 +65,21 @@ impl<'a> TermStreamerBuilder<'a> { | |||
|
|||
/// `TermStreamer` acts as a cursor over a range of terms of a segment. | |||
/// Terms are guaranteed to be sorted. | |||
pub struct TermStreamer<'a> { | |||
pub struct TermStreamer<'a, A> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make A=AlwaysMatch
a default
src/termdict/merger.rs
Outdated
use schema::Term; | ||
use std::cmp::Ordering; | ||
use std::collections::BinaryHeap; | ||
use termdict::TermOrdinal; | ||
use termdict::TermStreamer; | ||
|
||
pub struct HeapItem<'a> { | ||
pub streamer: TermStreamer<'a>, | ||
pub streamer: TermStreamer<'a, AlwaysMatch>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless once you set the default make
A=AlwaysMatch`
src/termdict/merger.rs
Outdated
@@ -44,7 +45,7 @@ pub struct TermMerger<'a> { | |||
impl<'a> TermMerger<'a> { | |||
/// Stream of merged term dictionary | |||
/// | |||
pub fn new(streams: Vec<TermStreamer<'a>>) -> TermMerger<'a> { | |||
pub fn new(streams: Vec<TermStreamer<'a, AlwaysMatch>>) -> TermMerger<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless once you set the default make
A=AlwaysMatch`
src/termdict/mod.rs
Outdated
@@ -381,7 +383,7 @@ mod tests { | |||
let source = ReadOnlySource::from(buffer); | |||
let term_dictionary: TermDictionary = TermDictionary::from_source(source); | |||
|
|||
let value_list = |mut streamer: TermStreamer| { | |||
let value_list = |mut streamer: TermStreamer<AlwaysMatch>| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless once you set the default make
A=AlwaysMatch`
src/termdict/mod.rs
Outdated
@@ -366,6 +366,8 @@ mod tests { | |||
|
|||
#[test] | |||
fn test_stream_range_boundaries() { | |||
use fst::automaton::AlwaysMatch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless once you set the default make
A=AlwaysMatch`
src/termdict/termdict.rs
Outdated
@@ -191,12 +193,19 @@ impl TermDictionary { | |||
|
|||
/// Returns a range builder, to stream all of the terms | |||
/// within an interval. | |||
pub fn range<'a>(&'a self) -> TermStreamerBuilder<'a> { | |||
pub fn range<'a>(&'a self) -> TermStreamerBuilder<'a, AlwaysMatch> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless once you set the default make
A=AlwaysMatch`
src/termdict/termdict.rs
Outdated
TermStreamerBuilder::new(self, self.fst_index.range()) | ||
} | ||
|
||
/// A stream of all the sorted terms. [See also `.stream_field()`](#method.stream_field) | ||
pub fn stream<'a>(&'a self) -> TermStreamer<'a> { | ||
pub fn stream<'a>(&'a self) -> TermStreamer<'a, AlwaysMatch> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless once you set the default make
A=AlwaysMatch`
src/query/range_query.rs
Outdated
@@ -215,7 +216,7 @@ pub struct RangeWeight { | |||
} | |||
|
|||
impl RangeWeight { | |||
fn term_range<'a>(&self, term_dict: &'a TermDictionary) -> TermStreamer<'a> { | |||
fn term_range<'a>(&self, term_dict: &'a TermDictionary) -> TermStreamer<'a, AlwaysMatch> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless once you set the default make
A=AlwaysMatch`
Default Types FTW
@fulmicoton thank you for helping me with this. It was cool to see this work. I'm now excited to try and close out some more of the fuzzy related issues. 👍 |
src/termdict/termdict.rs
Outdated
/// Returns a search builder, to stream all of the terms | ||
/// within the Automaton | ||
pub fn search<'a, A: Automaton>(&'a self, automaton: A) -> TermStreamerBuilder<'a, A> { | ||
let sb = self.fst_index.search(automaton); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid riddly short names like sb
. especially where the type is non-trviail to the reader.
sb
=> stream_builder
.
kudos! We're almost there. |
@drusellers That was the hardest part, right there! I'll fill another ticket to get the next step toward a fuzzy search. |
src/lib.rs
Outdated
@@ -136,9 +136,11 @@ extern crate combine; | |||
extern crate crossbeam; | |||
extern crate fnv; | |||
extern crate fst; | |||
extern crate fst_levenshtein; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove fst_levenshtein
now?
We're almost good to go. Can you just remove the deps to |
…oss#297) * rustfmt and some English grammar * sort cargo.toml crates * WIP: something to show * Remove example for now * Implement desired method * Resolving Generic Type Arguments * Resolve Generic Types * Banging around on the tests * DANGER! Change unsafe usage based on compiler warnings * Unscrew up my rebase * Clean Up Type Spam Default Types FTW * typo * better variable names * Remove Duplicate Levenshtein crate
Closes #273
Ok, so I've started down the path, but I'm for sure lost.