Permalink
Browse files

Updated readme

  • Loading branch information...
1 parent 9563c71 commit 6dc5bc67e1a6d64aa2af067b4e9880ffb2acea71 @ofavre ofavre committed Apr 20, 2012
Showing with 250 additions and 8 deletions.
  1. +250 −8 README.md
View
258 README.md
@@ -3,18 +3,260 @@ HashSplitter analysis plugin for ElasticSearch
HashSplitter plugin is a N-Gram tokenizer generating tokens that are not overlapping and are prefixed.
-It is aimed at making hashs (or any fixed length value splittable in equally sized chunks) partially searchables, without using a wildcard query.
-It can also help reduce the index size.
-However, depending on your configuration, if you do not wish to search for wildcard queries, you may experience slightly decreased performance.
-See http://elasticsearch-users.115913.n3.nabble.com/Advices-indexing-MD5-or-same-kind-of-data-td2867646.html for more information.
-
-
-In order to install the plugin, simply run: `bin/plugin -install yakaz/elasticsearch-analysis-hashsplitter/master`.
+In order to install the plugin, simply run: `bin/plugin -install yakaz/elasticsearch-analysis-hashsplitter/0.2.0`.
-------------------------------------------------
| HashSplitter Analysis Plugin | ElasticSearch |
-------------------------------------------------
| master | 0.19 -> master |
-------------------------------------------------
+ | 0.2.0 | 0.19 -> master |
+ -------------------------------------------------
+ | 0.1.0 | 0.19 -> master |
+ -------------------------------------------------
+
+It supports a wide variety of requests such as:
+
+- exact match
+- query by analyzed (prefixed) terms
+- wildcard query
+- range query
+- prefix query
+
+Here's a concrete example of the analysis performed:
+
+ chunk_length: 4
+ prefixes: ABCDEFGH
+ input: d41d8cd98f00b204e9800998ecf8427e
+ output:
+ - Ad41d
+ - B8cd9
+ - C8f00
+ - Db204
+ - Ee980
+ - F0998
+ - Gecf8
+ - H427e
+
+It is aimed at making hashs (or any fixed length value splittable in equally sized chunks) partially searchables efficiently, without having a plain wildcard query enumerate tons of terms.
+It can also help reduce the index size.
+
+However, depending on your configuration, if you do not wish to search for wildcard queries, you may experience slightly decreased performance.
+See http://elasticsearch-users.115913.n3.nabble.com/Advices-indexing-MD5-or-same-kind-of-data-td2867646.html for more information.
+
+
+Features
+--------
+
+The plugin provides:
+
+- **`hashsplitter` field type**
+- `hashsplitter` analyzer
+- `hashsplitter` tokenizer
+- `hashsplitter` token filter
+- `hashsplitter_term` query/filter (same syntax as the regular `term` query/filter)
+- `hashsplitter_wildcard` query/filter (same syntax as the regular `wildcard` query/filter)
+
+The plugin also provides correct support of the `hashsplitter` field type for the standard:
+
+- field query/filter (used by the `term` query/filter)
+- prefix query/filter
+- range query/filter
+
+The plugin does *not* support:
+
+- fuzzy query/filter
+
+The plugin _cannot_ currently support (as of ElasticSearch 0.19.0):
+
+- term query/filter: This gets mapped to a field query by ElasticSearch. Use the `hashsplitter_term` query instead.
+
+Note that a `query_string` query calls the field, prefix, range and fuzzy capability of the `hashsplitter` field automatically.
+But make sure you actually use the `hashsplitter` field type and direct the query to that field (and not the `_all` field for eg.).
+
+
+Configuration
+-------------
+
+It is recommended that you use the `hashsplitter` field type, this will enable custom querying easily.
+It is also the only way of using the field, prefix and range queries/filters.
+The alternative would be to use the `hashsplitter` analysis on the field, and be spent extra care to the way you query the field.
+
+### The `hashsplitter` field type ###
+
+Here is a sample mapping (in `config/mapping/your_index/your_mapping_type.json`):
+
+ {
+ "your_mapping_type" : {
+ properties : {
+ [...]
+ "your_hash_field" : {
+ type: "hashsplitter",
+ settings : {
+ chunk_length: 4,
+ prefixes: "abcd",
+ size: 16,
+ wildcard_one: "?",
+ wildcard_any: "*"
+ }
+ },
+ [...]
+ }
+ }
+ }
+
+This will define the `your_hash_field` field within the `your_mapping_type` as having the `hashsplitter` type.
+Notice the unusual `settings` section. It will be parsed by the plugin in order to configure the tokenization according to your needs.
+
+#### Parameters: ####
+
+- `chunk_length`: The length of the chunks generated by the analysis.
+ The input "0123456789" with a `chunk_length` of 2 will be split into `[01, 23, 45, 67, 89]`,
+ with a `chunk_length` of 3 it will be split into `[012, 345, 678, 9]`.
+ Note that the last chunk can be shorter than `chunk_length` characters.
+- `prefixes`: The positional prefixes to prepend to each chunk.
+ Each individual character in the given string will be used, in turn.
+ The chunks `[000, 111, 222, 333]` with a `"abc"` `prefixes` will generate the following terms: `[a000, b111, c222, a333]`.
+ Note how it wraps if there are not enough prefix character available.
+ You want to avoid this as is will make `a000` and `a333` indistinguishable.
+- `size`: How long are the input hash supposed to be, as an integer, or `"variable"`.
+ This won't prevent bad values from being analyzed at all.
+ This information is solely used by the wildcard query/filter in order to expand `*`s properly.
+- `wildcard_one`: Which character to be used as a _single character wildcard_. A single character string.
+ This may help you if the default `?` is a genuine input character.
+ It is solely used in the wildcard query/filter.
+- `wildcard_any`: Which character to be used a a _any string wildcard_. A single character string.
+ This may help you if the default `*` is a genuine input character.
+ It is solely used in the wildcard query/filter.
+
+##### Default values: #####
+
+All parameters are optional, so is the `settings` section.
+
+- `chunk_length`: 1
+- `prefixes`: `"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,."`
+- `size`: `"variable"`
+- `wildcard_one`: `"?"`
+- `wildcard_any`: `"*"`
+
+
+### The `hashsplitter` analyzer, tokenizer and token filter ###
+
+Those analysis components will merely split the input into chunks of fixed size and prefix them.
+Each of these have 2 parameters that you want to define in the configuration.
+
+Here is a sample configuration (in `config/elasticsearch.yml`):
+
+ index.analysis:
+ analysis:
+ analyzer:
+ your_hash_analyzer:
+ type: hashsplitter
+ chunk_length: 4
+ prefixes: ABCDEFGH
+ tokenizer:
+ your_hash_tokenizer:
+ type: hashsplitter
+ chunk_length: 4
+ prefixes: ABCDEFGH
+ filter:
+ your_hash_tokenfilter:
+ type: hashsplitter
+ chunk_length: 4
+ prefixes: ABCDEFGH
+
+This will configure both an analyzer, a tokenizer and a token filter (all of them being separate).
+You can then create your custom analyzer using the newly configured tokenizer or/and token filter.
+Note that _that_ **custom** analyzer will have `type: custom`.
+
+#### Parameters: ####
+
+- `chunk_length`
+- `prefixes`
+
+See `hashsplitter` field type parameters for more information.
+
+
+Usage: querying
+---------------
+
+### Term query ###
+
+ {
+ term : {
+ your_hash_field: "d41d8cd98f00b204e9800998ecf8427e"
+ }
+ }
+
+**Note**: The length is not checked.
+However, if your field values are always of the same fixed length and your query value is of that same size too, then your safe.
+
+You will need to understand how this query works in order to clarify this warning a bit.
+The same analysis is performed when indexing the field, and when processing this query. The searched term will get split into terms, which will be merely AND-ed together. Hence, any additional terms (a longer field value) won't prevent the match.
+However, if the last term chunk is not of the correct size, no match will occur! (eg: `"d41d8"` would generate the query `+Ad41d +B8` and `B8` will never match.)
+
+**Positive side-effect**: If the field length is not a multiple of the chunk length, then the match will only include same-length hashes, as a longer hash would have a longer (hence different, non matching) last term.
+
+Do not use the default `term` query, as unlike stated in the docs, the provided term **is** analyzed, hence the provided value gets chunked and prefixed and AND-ed.
+
+### Chunk query ###
+
+ {
+ hashsplitter_term : {
+ your_hash_field: "H427e"
+ }
+ }
+
+This query allows you to match the generated terms exactly. No analysis is performed. A pure `TermQuery` is generated with the given field and term.
+This query is the only way to express yourself the prefix along with the chunk value to be queried.
+
+Note that in the default `term` query, as unlike stated in the docs, the provided term **is** analyzed, hence the provided value gets chunked and prefixed and pieces are AND-ed together.
+
+### Prefix query ###
+
+ {
+ prefix : {
+ your_hash_field: "d41d8"
+ }
+ }
+
+Assuming `chunk_length` = 4, this will generate the query `+A41d8 +PREFIX`, where `PREFIX` is the prefix query `B8*`, filtered to only return terms whose size is between 2 and 5 (or equal to the resting size, if size is fixed).
+
+### Range query ###
+
+ {
+ range : {
+ your_hash_field : {
+ from: "d4000000000000000000000000000000",
+ include_lower: true,
+ to: "d4200000000000000000000000000000",
+ include_upper: false
+ }
+ }
+ }
+
+The generated range queries will be optimized to only query the terms at the _required_ level, like Lucene NumericRangeQuery does. (With the difference that in Lucene the whole term upto the cut level is included, whereas we only include a middle chunk without the previous ones).
+
+The lexicographical ordering of terms will be used. The used prefixes wont have any influence but the length of the terms will. For instance the range `[d400 TO d42]` (both inclusive) will match `d400 0000 0000 ...` but not `d420 0000 0000 ...` (space added to visualise the generated chunks), because `d42` sorts before `d420`, hence the latter is not included within the range.
+
+### Wildcard query ###
+
+ {
+ hashsplitter_wildcard : {
+ your_hash_field : "d41?8*27e"
+ }
+ }
+
+Note: The `?` and `*` wildcards must match the one configured in the field type mapping (these are the default values).
+
+The `*` wildcard is restricted to **one usage** per pattern, and text may appear after it _if and only if_ the field type mapping
+uses a fixed size. Using `*` at the end is always possible and equates to a prefix query with eventual `?` wildcards.
+
+This restriction arise from the fact that prefixes are used to “locate” chunks, hence all character un the pattern must be located precisely. Using more that one `*` makes it impossible to deterministically perform this localisation. A simple fallback will however be used: the particular case where all `*` match a zero length string. But this is likely to be of no help...
+
+### Sophisticated queries ###
+
+As long as you query against `your_hash_field` (the field of type `hashsplitter`), the generated queries should function like the one described above.
+Sophisticated queries can often create multiple of the above queries, as they use complex lexical analysis that express combinations of multiple queries (eg. `"+ANDed_token ORed_token -NOT_token [from_token TO to_token]"`).
-The plugin includes the `hash_splitter` tokenizer and token filter.
+Note that the default `wildcard` query won't function in the intended way. Don't use them through sophisticated queries through `analyze_wildcard`.

0 comments on commit 6dc5bc6

Please sign in to comment.