Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 295 lines (223 sloc) 12.862 kb
266d86a @rzezeski Add IMPLEMENTATION.md
authored
1 Implementation Notes
2 ==========
3
4 Notes on the implementation _before_ it is implemented. Think of it
5 something like [readme driven development] [rdd].
6
7
746c74a @rzezeski Spitballing ideas for better indexing
authored
8 Indexing
9 ----------
10
11 ### Avoid Post-Commit Hook
12
13 * The object must be sent `2 * N` times. It needs to be `N` times for
14 the KV write and another `N` times for indexing. In the worst case
15 all of those messages have to traverse the network and in the best
16 case `2N - 2` messages have to. If the object size is 5MB--the
17 block size in RiakCS--then 30MB of data must traverse the network,
18 15MB of which is redundant.
19
20 * Post-commit is executed on the coordinator after the object is
21 written and a client reply is sent. This provides no back pressure
22 to the client.
23
24 * Post-commit error handling is wrong. It hides errors and just
25 increments a counter kept by stats. You must alter the logging
26 levels at runtime to discover the cause of the errors.
27
28 * Post-commit is only invoked during user put requests. Indexing
29 changes also need to occur during read-repair and handoff events.
30 Any time the KV object changes the index needs to change as well (at
31 minimum the object hash must be updated).
32
33 ### Add Event Hooks to VNodes
34
35 * Gives Yokozuna access to all object modifications.
36
37 * Exploits locality, avoids redundant transmission of the object
38 across the network.
39
40 * Provides back-pressure during client writes.
41
42 * Could set the stage for atomic commits between KV and other
43 services if that's something we wanted to pursue.
44
45 * A downside is that now more is happening on the KV vnode which is a
46 high contention point as it is. Measuring and careful thought is
47 needed here.
48
49 ### Ideas for Implementation
50
51 * I'm not sure if this is a generic vnode thing or specific to the KV
52 vnode. Right now I'm leaning towards the latter.
53
54 * The events Yokozuna needs to react to: put, read-repair (which is
55 ultimately a put), and handoff (which once again is just a put).
56 Maybe all I need is a low-level hook into the KV backend put. Might
57 help to think of Yokozuna as a backend to KV that compliments the
58 primary backend. Using a low-level put hook covers all cases since
59 it is invoked any time the object is modified. It also provides
60 some future proofing as it should always be the least common
61 denominator for any future object mutation (e.g. if some new type of
62 event was added to KV that causes the object to be modified).
63
64 * Invoke the hook IFF the object has changed and the write to the
65 backend was successful. Have to look at `PrepPutRes` from
66 `prepare_put` and `Reply` from `perform_put`.
67
68 * The deletion of an object from a backend is a separate code path.
69 Need to hook into that as well.
70
71 * The handoff object put is a different code path, see
72 `do_diffobj_put`. Need to hook into this.
73
74 * Yokozuna handoff piggy-backs KV handoff (by re-indexing on put
75 versus sending across Solr index data) and therefore Yokozuna vnode
76 handoff is simple matter of dropping the index. Actually, this is a
77 lie. If the KV vnode doesn't handoff first then an entire partition
78 of replicas is lost temporarily. The Yokozuna vnode needs a way to
79 tell the handoff system that it cannot start handoff until the KV
80 service for the same partition performs handoff. This could be done
81 by returning `{waiting_for, [riak_kv]}`. The vnode manager will
82 probably have to be modified.
83
950f2d6 @rzezeski Add implementation notes about searching
authored
84 Searching
85 ----------
86
87 Solr already provides distributed search. However, it is up to the
88 client, in this case Yokozuna, which _shards_ to run the query
89 against. The caller specifies the shards and Solr handles the
90 collating.
91
92 The shards should be mutually exclusive if you want the results to be
93 correct. If the same doc id appears in the rows returned then Solr
94 will remove all but one instance. Which instances Solr removes is
95 non-deterministic. In the case where the duplicates aren't in the
96 rows returned and the total rows matching is greater than those
97 returned then `numCount` may be incorrect.
98
99 This poses a problem for Yokozuna since it replicates documents. A
100 document spans multiple shards thus neighboring shards will have
101 overlapping document sets. Depending on the number of partitions
102 (also referred to as _ring size_) and number of nodes it may be
103 possible to pick a set of shards which contain the entire set of
104 documents with no overlap. In most cases, however, overlap cannot be
105 avoided.
106
107 The presence of overlap means that Yokozuna can't simply query a set
108 of shards. The overlapping could cause `numCount` to be wildly off.
109 Yokozuna could use a Solr Core per index/partition combination but
110 this could cause an explosion in the number of Core instances. Also,
111 more core instances means more file descriptor usage and less chance
112 for Solr to optimize Core usage. A better approach is to filter the
113 query.
114
115 Riak Core contains code to plan and execute _coverage_ queries. The
116 idea is to calculate a set of partitions which when combined covers
117 the entire set of data. The list of unique nodes, or shards, and the
118 list of partitions can be obtained from coverage. The question is how
119 to filter the data in Solr using the partitions generated by the
120 coverage plan?
121
122 At write time Yokozuna sends the document to `N` different partitions.
123 Each partition does a write to it's local Solr Core instance. A Solr
124 _document_ is a set of field-value pairs. Yokozuna can leverage this
125 fact by adding a partition number field (`_pn`) during the local
126 write. A document will be replicated `N` times but each replica will
127 contain a different `_pn` value based on it's owning partition. That
128 takes care of the first half of the problem, getting the partition
129 data in Solr. Next it must be filtered on.
130
1f2e472 @rzezeski Add ticks
authored
131 The most obvious way to filter on `_pn` is append to the user query.
950f2d6 @rzezeski Add implementation notes about searching
authored
132 For example, if the user query is `text:banana` then Yokozuna would
133 transform it to something like `text:banana AND (_pn:<pn1> OR
134 _pn:<pn2> ... OR _pn:<pnI>)`. The new query will only accept
135 documents that have been stored by the specified partitions. This
136 works but a more efficient, and perhaps elegant, method is to use
137 Solr's _filter query_ mechanism.
138
139 Solr's filter query is like a regular query but it does not affect
140 scoring and it's results are cached. Since a partition can contain
141 many documents caching may sound scary. However the cache value is a
142 `BitDocSet` which uses a single bit for each document. That means a
143 megabyte of memory can cache over 8 million documents. The resulting
144 query generated by Yokozuna then looks like the following.
145
146 q=text:banana&fq=_pn:P2 OR _pn:P5 ... OR _pn:P65
147
148 It may seem like this is the final solution but there is still one
149 last problem. Earlier I said that the covering set of partitions
150 accounts for all the data. This is true, but in most cases it
151 accounts for a little bit more than all the data. Depending on the
152 number of partitions (`Q`) and the number of replicas (`N`) there may
153 be no possible way to select a set of partitions that covers _exactly_
154 the total set of data. To be precise, if `N` does not evenly divide
155 into `Q` then the number of overlapping partitions is `L = N - (Q rem
156 N)`. For the defaults of `Q=64` and `N=3` this means `L = 3 - (64 rem
157 3)` or `L=2`.
158
159 To guarantee that only the total set of unique documents is returned
160 the overlapping partitions must be filtered out. To do this Yokozuna
161 takes the original set of partitions and performs a series of
162 transformations ending with the same list of partitions but with
163 filtering data attached to each. Each partition will have either the
164 value `any` or a list of partitions paired with it. The value
165 indicates which of it's replicas to include based on the first
166 partition that owns it. The value `any` means to include a replica no
167 matter which partition is the first to own it. Otherwise the
168 replica's first owner must be one of the partitions in the include
169 list.
170
171 In order to perform this additional filter the first partition number
172 must be stored as a field in the document. This is the purpose of the
173 `_fpn` field. Using the final list of partitions, with the filtering
174 data now added, each `{P, all}` pair can be added as a simple `_pn:P`
175 to the filter query. However, a `{P, IFPs}` pair must restrain on the
176 `_fpn` field as well. The P and IFPs must be applied together. If
177 you don't constrain the IFPs to only apply to P then they will apply
178 to the entire query and only a subset of the total data will be
179 returned. Thus a `{P, [IFP1, IFP2]}` pair will be converted to
180 `(_pn:P AND (_fpn:IFP1 OR _fpn:IFP2))`. The final query, achieving
181 100% accuracy, will look something like the following.
182
183 q=text:banana&fq=_pn:P2 OR _pn:P5 ... OR (_pn:P60 AND (_fpn:60)) OR _pn:63
184
185
266d86a @rzezeski Add IMPLEMENTATION.md
authored
186 Index Mapping & Cores
187 ----------
188
8c1a606 @rzezeski Add notice to README
authored
189 Solr has the notion of a [core] [solr_core] which allows multiple
190 indexes to live under the same Solr/JVM instance. This is useful
191 because it allows isolation of index files as well as schemas and
192 configuration. Yokozuna exposes the notion of cores as _indexes_.
193 Each index has a unique name and maps to **one** core.
194
266d86a @rzezeski Add IMPLEMENTATION.md
authored
195 * Set `persistent` to `true` in `solr.xml` so that changes during
196 runtime will persist on restart.
197
198 * Must have an `adminPath` property for `cores` element or else
199 dynamic manipulation will not work.
200
201 * Potentially use `adminHandler` and create custom admin handler for
202 Riak integration.
203
204 * The core name is the unique index name used in yokozuna.
205 I.e. yokozuna calls an index what Solr calls a core. However, there
206 is a many-to-one mapping of external names, or aliases, to index
207 names.
208
209 * According to the Solr wiki it overwrites the `solr.xml` when core
210 data is changed. In order to protect against corruption yokozuna
211 might want to copy this off somewhere before each core modification.
212
213 * The core `CREATE` command only works if the instance dir and config
214 is already there. This means that yokozuna will have to store a
215 default setup and copy it over before calling `CREATE`.
216
217 * An HTTP endpoint `yz/index/create` will allow the creation of an
218 index. Underneath it will call `yz_index:create`.
219
220 * There is an implicit mapping from the index name to itself.
221
222 ### Integrating with Riak KV
223
224 * Ideally, KV should know nothing about yokozuna. Rather yokozuna
225 should register a hook with KV and deal with the rest. Yokozuna
226 will have knowledge of Riak Object for now. This should probably be
227 isolated to a module like `yz_riak_kv` or something.
228
229 * Yokozuna is "enabled" on a bucket by first mapping a bucket name to
230 an index. Second, the `yz_riak_kv:postcommit` hook must be
231 installed on the bucket.
232
233 * Using a postcommit should suffice for prototyping but tighter
234 integration will be needed so that updates may be sent to yokozuna
235 during events such as read-repair. I would still do this in the
236 form of registering a callback versus coupling the two directly.
237
238 * Yokozuna uses a postcommit because the data should exist before the
239 index that references it. This could potentially cause overload
240 issues since no indexing back pressure will be provided. There may
241 be ways to deal with this in yokozuna rather than KV such as lagged
242 indexing with an append-only log.
243
244 ### Module Breakdown
245
246 * `yz_riak_kv` - All knowledge specific to Riak KV should reside in
247 here to keep it isolated.
248
249 * `yz_index` - All functionality re indexes such as mapping and
250 administrative.
251
252 * `yz_solr` - All functionality related to making a request to solr.
253 In regards to indexes this module should provide functions to
254 administrate cores.
255
256 ### API Breakdown
257
258 * `PUT yz/index?name=<name>&initial_schema=<schema_name>` - create a
259 new index with `name` (required) based on the `initial_schema` name
260 or the default schema if none is provided.
261
262 * `PUT /yz/mapping?alias=<alias>&name=<name>` - create a mapping from
263 the `alias` to the index `name`.
264
265 * `PUT /yz/kv_hook?bucket=<bucket>&index=<index>&schema=<schema>` -
266 install a hook into KV on the `bucket` which maps to `index` and
267 uses `schema`. This subsumes the above two so maybe they aren't
268 needed for now.
269
270 * `yz_riak_kv:install_kv_hook(Bucket, Index, Schema)` - Same as
271 previous but via Erlang. The HTTP API is so the user can install
272 the hook.
273
274 ### Use Case Rundown
275
276 1. User registers yz hook for bucket `B` via `PUT /yz/kv_hook?bucket=B&index=B&schema=default`.
277
278 2. User writes value `V` under bucket `B` and key `K`.
279
280 3. The `put_fsm` is spun-up, object `O` is created.
281
282 4. The quorum is met, the yz hook is called with object `O`.
283
284 5. The index is determined by pulling `B` and retrieving registered
285 index `I`.
286
287 6. The object `O` is converted to the doc `Doc` for storage in Solr.
288
289 7. `N` copies of `Doc` are written across `N` shards.
290
291
292 [rdd]: http://tom.preston-werner.com/2010/08/23/readme-driven-development.html
8c1a606 @rzezeski Add notice to README
authored
293
294 [solr_core]: http://wiki.apache.org/solr/CoreAdmin
Something went wrong with that request. Please try again.