Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 211 lines (160 sloc) 9.253 kB
266d86a @rzezeski Add IMPLEMENTATION.md
authored
1 Implementation Notes
2 ==========
3
4 Notes on the implementation _before_ it is implemented. Think of it
5 something like [readme driven development] [rdd].
6
7
950f2d6 @rzezeski Add implementation notes about searching
authored
8 Searching
9 ----------
10
11 Solr already provides distributed search. However, it is up to the
12 client, in this case Yokozuna, which _shards_ to run the query
13 against. The caller specifies the shards and Solr handles the
14 collating.
15
16 The shards should be mutually exclusive if you want the results to be
17 correct. If the same doc id appears in the rows returned then Solr
18 will remove all but one instance. Which instances Solr removes is
19 non-deterministic. In the case where the duplicates aren't in the
20 rows returned and the total rows matching is greater than those
21 returned then `numCount` may be incorrect.
22
23 This poses a problem for Yokozuna since it replicates documents. A
24 document spans multiple shards thus neighboring shards will have
25 overlapping document sets. Depending on the number of partitions
26 (also referred to as _ring size_) and number of nodes it may be
27 possible to pick a set of shards which contain the entire set of
28 documents with no overlap. In most cases, however, overlap cannot be
29 avoided.
30
31 The presence of overlap means that Yokozuna can't simply query a set
32 of shards. The overlapping could cause `numCount` to be wildly off.
33 Yokozuna could use a Solr Core per index/partition combination but
34 this could cause an explosion in the number of Core instances. Also,
35 more core instances means more file descriptor usage and less chance
36 for Solr to optimize Core usage. A better approach is to filter the
37 query.
38
39 Riak Core contains code to plan and execute _coverage_ queries. The
40 idea is to calculate a set of partitions which when combined covers
41 the entire set of data. The list of unique nodes, or shards, and the
42 list of partitions can be obtained from coverage. The question is how
43 to filter the data in Solr using the partitions generated by the
44 coverage plan?
45
46 At write time Yokozuna sends the document to `N` different partitions.
47 Each partition does a write to it's local Solr Core instance. A Solr
48 _document_ is a set of field-value pairs. Yokozuna can leverage this
49 fact by adding a partition number field (`_pn`) during the local
50 write. A document will be replicated `N` times but each replica will
51 contain a different `_pn` value based on it's owning partition. That
52 takes care of the first half of the problem, getting the partition
53 data in Solr. Next it must be filtered on.
54
1f2e472 @rzezeski Add ticks
authored
55 The most obvious way to filter on `_pn` is append to the user query.
950f2d6 @rzezeski Add implementation notes about searching
authored
56 For example, if the user query is `text:banana` then Yokozuna would
57 transform it to something like `text:banana AND (_pn:<pn1> OR
58 _pn:<pn2> ... OR _pn:<pnI>)`. The new query will only accept
59 documents that have been stored by the specified partitions. This
60 works but a more efficient, and perhaps elegant, method is to use
61 Solr's _filter query_ mechanism.
62
63 Solr's filter query is like a regular query but it does not affect
64 scoring and it's results are cached. Since a partition can contain
65 many documents caching may sound scary. However the cache value is a
66 `BitDocSet` which uses a single bit for each document. That means a
67 megabyte of memory can cache over 8 million documents. The resulting
68 query generated by Yokozuna then looks like the following.
69
70 q=text:banana&fq=_pn:P2 OR _pn:P5 ... OR _pn:P65
71
72 It may seem like this is the final solution but there is still one
73 last problem. Earlier I said that the covering set of partitions
74 accounts for all the data. This is true, but in most cases it
75 accounts for a little bit more than all the data. Depending on the
76 number of partitions (`Q`) and the number of replicas (`N`) there may
77 be no possible way to select a set of partitions that covers _exactly_
78 the total set of data. To be precise, if `N` does not evenly divide
79 into `Q` then the number of overlapping partitions is `L = N - (Q rem
80 N)`. For the defaults of `Q=64` and `N=3` this means `L = 3 - (64 rem
81 3)` or `L=2`.
82
83 To guarantee that only the total set of unique documents is returned
84 the overlapping partitions must be filtered out. To do this Yokozuna
85 takes the original set of partitions and performs a series of
86 transformations ending with the same list of partitions but with
87 filtering data attached to each. Each partition will have either the
88 value `any` or a list of partitions paired with it. The value
89 indicates which of it's replicas to include based on the first
90 partition that owns it. The value `any` means to include a replica no
91 matter which partition is the first to own it. Otherwise the
92 replica's first owner must be one of the partitions in the include
93 list.
94
95 In order to perform this additional filter the first partition number
96 must be stored as a field in the document. This is the purpose of the
97 `_fpn` field. Using the final list of partitions, with the filtering
98 data now added, each `{P, all}` pair can be added as a simple `_pn:P`
99 to the filter query. However, a `{P, IFPs}` pair must restrain on the
100 `_fpn` field as well. The P and IFPs must be applied together. If
101 you don't constrain the IFPs to only apply to P then they will apply
102 to the entire query and only a subset of the total data will be
103 returned. Thus a `{P, [IFP1, IFP2]}` pair will be converted to
104 `(_pn:P AND (_fpn:IFP1 OR _fpn:IFP2))`. The final query, achieving
105 100% accuracy, will look something like the following.
106
107 q=text:banana&fq=_pn:P2 OR _pn:P5 ... OR (_pn:P60 AND (_fpn:60)) OR _pn:63
108
109
266d86a @rzezeski Add IMPLEMENTATION.md
authored
110 Index Mapping & Cores
111 ----------
112
113 * Set `persistent` to `true` in `solr.xml` so that changes during
114 runtime will persist on restart.
115
116 * Must have an `adminPath` property for `cores` element or else
117 dynamic manipulation will not work.
118
119 * Potentially use `adminHandler` and create custom admin handler for
120 Riak integration.
121
122 * The core name is the unique index name used in yokozuna.
123 I.e. yokozuna calls an index what Solr calls a core. However, there
124 is a many-to-one mapping of external names, or aliases, to index
125 names.
126
127 * According to the Solr wiki it overwrites the `solr.xml` when core
128 data is changed. In order to protect against corruption yokozuna
129 might want to copy this off somewhere before each core modification.
130
131 * The core `CREATE` command only works if the instance dir and config
132 is already there. This means that yokozuna will have to store a
133 default setup and copy it over before calling `CREATE`.
134
135 * An HTTP endpoint `yz/index/create` will allow the creation of an
136 index. Underneath it will call `yz_index:create`.
137
138 * There is an implicit mapping from the index name to itself.
139
140 ### Integrating with Riak KV
141
142 * Ideally, KV should know nothing about yokozuna. Rather yokozuna
143 should register a hook with KV and deal with the rest. Yokozuna
144 will have knowledge of Riak Object for now. This should probably be
145 isolated to a module like `yz_riak_kv` or something.
146
147 * Yokozuna is "enabled" on a bucket by first mapping a bucket name to
148 an index. Second, the `yz_riak_kv:postcommit` hook must be
149 installed on the bucket.
150
151 * Using a postcommit should suffice for prototyping but tighter
152 integration will be needed so that updates may be sent to yokozuna
153 during events such as read-repair. I would still do this in the
154 form of registering a callback versus coupling the two directly.
155
156 * Yokozuna uses a postcommit because the data should exist before the
157 index that references it. This could potentially cause overload
158 issues since no indexing back pressure will be provided. There may
159 be ways to deal with this in yokozuna rather than KV such as lagged
160 indexing with an append-only log.
161
162 ### Module Breakdown
163
164 * `yz_riak_kv` - All knowledge specific to Riak KV should reside in
165 here to keep it isolated.
166
167 * `yz_index` - All functionality re indexes such as mapping and
168 administrative.
169
170 * `yz_solr` - All functionality related to making a request to solr.
171 In regards to indexes this module should provide functions to
172 administrate cores.
173
174 ### API Breakdown
175
176 * `PUT yz/index?name=<name>&initial_schema=<schema_name>` - create a
177 new index with `name` (required) based on the `initial_schema` name
178 or the default schema if none is provided.
179
180 * `PUT /yz/mapping?alias=<alias>&name=<name>` - create a mapping from
181 the `alias` to the index `name`.
182
183 * `PUT /yz/kv_hook?bucket=<bucket>&index=<index>&schema=<schema>` -
184 install a hook into KV on the `bucket` which maps to `index` and
185 uses `schema`. This subsumes the above two so maybe they aren't
186 needed for now.
187
188 * `yz_riak_kv:install_kv_hook(Bucket, Index, Schema)` - Same as
189 previous but via Erlang. The HTTP API is so the user can install
190 the hook.
191
192 ### Use Case Rundown
193
194 1. User registers yz hook for bucket `B` via `PUT /yz/kv_hook?bucket=B&index=B&schema=default`.
195
196 2. User writes value `V` under bucket `B` and key `K`.
197
198 3. The `put_fsm` is spun-up, object `O` is created.
199
200 4. The quorum is met, the yz hook is called with object `O`.
201
202 5. The index is determined by pulling `B` and retrieving registered
203 index `I`.
204
205 6. The object `O` is converted to the doc `Doc` for storage in Solr.
206
207 7. `N` copies of `Doc` are written across `N` shards.
208
209
210 [rdd]: http://tom.preston-werner.com/2010/08/23/readme-driven-development.html
Something went wrong with that request. Please try again.