-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
408 lines (396 loc) · 19.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<title>Embracing Diversity: Searching over multiple languages</title>
<meta name="description" content="Embracing Diversity: Searching over multiple languages">
<meta name="author" content="Tommaso Teofili; Suneel Marthi">
<meta name="apple-mobile-web-app-capable" content="yes"/>
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="font-awesome-4.7.0/css/font-awesome.min.css">
<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="css/theme/bbuzz17.css">
<!-- Theme used for syntax highlighting of code -->
<link rel="stylesheet" href="lib/css/zenburn.css">
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? 'css/print/pdf.css' : 'css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<!--[if lt IE 9]> <script src="lib/js/html5shiv.js"></script> <![endif]-->
</head>
<body>
<div class="reveal">
<div class="slides">
<section data-background-image="img/buzzwords_2017.png">
<br/>
<br/>
<br/>
<h3>Embracing Diversity: Searching over multiple languages</h3>
<br/>
<p>Tommaso Teofili<br/>
Suneel Marthi
</p>
<p style="font-size: 60%">June 12, 2017<br/>
Berlin Buzzwords, Berlin, Germany</p>
</section>
<section data-background-image="img/ZsMwlm2.jpg">
<h3>$WhoAreWe</h3>
<h6 style='text-align: left; font-size: 80%;'>Tommaso Teofili <br/><span style="font-size: 60%"><i class="fa fa-twitter" aria-hidden="true"></i> @tteofili</span></h6>
<ul style='font-size: 70%;'>
<li>Software Engineer, Adobe Systems</li>
<li>Member of Apache Software Foundation,<br/></li>
<li>PMC Chair, Apache Lucene</li>
<li>Committer and PMC on Apache Joshua, Apache OpenNLP, Apache JackRabbit</li>
</ul>
<br/>
<br/>
<h6 style='text-align: left; font-size: 80%;'>Suneel Marthi <br/><span style="font-size: 60%"><i class="fa fa-twitter" aria-hidden="true"></i> @suneelmarthi</span></h6>
<ul style='font-size: 70%;'>
<li>Principal Software Engineer, Office of Technology, Red Hat</li>
<li>Member of Apache Software Foundation</li>
<li>Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams</li>
</ul>
</section>
<section data-background-image="img/ZsMwlm2.jpg">
<h3>Agenda</h3>
<ul>
<li>What is Multi-Lingual Search ?</li>
<li>Why Multi-Lingual Search ?</li>
<li>What is Statistical Machine Translation ?</li>
<li>Overview of Apache Joshua</li>
<li>Dataflow Pipeline</li>
<li>Demo</li>
</ul>
</section>
<section data-background-image="img/ZsMwlm2-blueish.jpg">
<h3>What is Multi-Lingual Search ?</h3>
</section>
<section data-background-image="img/ZsMwlm2-blueish.jpg">
<ul>
<li>Searching</li>
<ul>
<li>over content written in different languages</li>
<li>with users speaking different languages</li>
<li>both</li>
</ul>
<li>Parallel corpora</li>
<li>Translating queries</li>
<li>Translating documents</li>
</ul>
</section>
<section data-background-image="img/ZsMwlm2-redish.jpg">
<h3>Why Multi-Lingual Search ?</h3>
</section>
<section data-background-image="img/ZsMwlm2-redish.jpg">
<h3>Embracing diversity</h3>
<ul>
<li>Most online tech content is in English</li>
<ul>
<li>Wikipedia dumps:</li>
<ul>
<li>en: 62GB</li>
<li>de: 17GB</li>
<li>it: 10GB</li>
</ul>
</ul>
<li>Good number of non-English speaking users</li>
<li>A lot of search queries are composed in English</li>
<li>Preferable to retrieve search results in native language</li>
<li>… or even to consolidate all results in one language</li>
</ul>
</section>
<section data-background-image="img/ZsMwlm2-redish.jpg">
<h3>UC1 — tech domain, native first</h3>
<img src='img/uc1.png' style='max-width: 75%; border: none; background: none;box-shadow: none; '/>
</section>
<section data-background-image="img/ZsMwlm2-redish.jpg">
<h3>UC2 — native only ?</h3>
<img src='img/uc2.png' style='border: none; background: none;box-shadow: none; '/>
</section>
<section data-background-image="img/ZsMwlm2-purpleish.jpg">
<h3>What is Machine Translation ?</h3>
</section>
<section data-background-image="img/ZsMwlm2-purpleish.jpg">
<p style='text-align: left'>Generate Translations from Statistical Models trained on Bilingual Corpora.</p>
<p style='text-align: left'>Translation happens per a probability distribution <code>p(e/f)</code></p>
<pre><code data-trim data-noescape>E = string in the target language (English)
F = string in the source language (Spanish)
e~ = argmax p(e/f) = argmax p(f/e) * p(e)
e~ = best translation, the one with highest probability</code></pre>
</section>
<section data-background-image="img/ZsMwlm2-greenish.jpg">
<h3>Word-based Translation</h3>
</section>
<section data-background-image="img/ZsMwlm2-greenish.jpg">
<dl>
<dt>How to translate a word → lookup in dictionary</dt>
<dd>Gebäude — building, house, tower.</dd>
<br/>
<dt>Multiple translations</dt>
<dd>some more frequent than others<br/>
for instance: house and building most common</dd>
</dl>
</section>
<section data-background-image="img/ZsMwlm2-greenish.jpg">
<p>Look at a parallel corpus<br/>(German text along with English translation)</p>
<table>
<thead>
<tr>
<th>Translation of Gebäude</th><th>Count</th><th>Probability</th>
</tr>
</thead>
<tbody>
<tr>
<td>house</td><td>5.28 billion</td><td>0.51</td>
</tr>
<tr>
<td>building</td><td>4.16 billion</td><td>0.402</td>
</tr>
<tr>
<td>tower</td><td>9.28 million</td><td>0.09</td>
</tr>
</tbody>
</table>
</section>
<section data-background-image="img/ZsMwlm2-yellowish.jpg">
<h3>Alignment</h3>
<ul>
<li>In a parallel text (or when we translate), we align words in one language with the word in the other<br/>
<table>
<tr>
<td>Das</td><td>Gebäude</td><td>ist</td><td>hoch</td>
</tr>
<tr>
<td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td>
</tr>
<tr>
<td>the</td><td>building</td><td>is</td><td>high</td>
</tr>
</table>
</li>
<li>Word positions are numbered 1—4</li>
</ul>
</section>
<section data-background-image="img/ZsMwlm2-yellowish.jpg">
<h3>Alignment Function</h3>
<ul>
<li>Define the Alignment with an Alignment Function</li>
<li>Mapping an English target word at position <code>i</code> to <code>a</code> German source word at position <code>j</code> with a function <code>a : i → j</code></li>
<li>Example</li>
</ul>
<pre><code data-trim data-noescape>a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}</code></pre>
</section>
<section data-background-image="img/ZsMwlm2-blueish.jpg">
<h3>One-to-Many Translation</h3>
<p>A source word could translate into multiple target words</p>
<table>
<tr>
<td>Das</td><td>ist</td><td>ein</td><td colspan='3' style='text-align: center;'>Hochhaus </td>
</tr>
<tr>
<td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↙</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↘</td>
</tr>
<tr>
<td>This</td><td>is</td><td>a</td><td>high </td><td>rise</td><td>building</td>
</tr>
</table>
</section>
<section data-background-image="img/ZsMwlm2-redish.jpg">
<h3>Phrase-based Translation</h3>
</section>
<section data-background-image="img/ZsMwlm2-redish.jpg">
<h3>Alignment Function</h3>
<ul>
<li>Word-Based Models translate words as atomic units</li>
<li>Phrase-Based Models translate phrases as atomic units</li>
<li>Advantages:</li>
<ul>
<li>many-to-many translation can handle non-compositional phrases</li>
<li>use of local context in translation</li>
<li>the more data, the longer phrases can be learned</li>
</ul>
<li>“Standard Model”, used by Google Translate and others</li>
</ul>
</section>
<section data-background-image="img/ZsMwlm2-redish.jpg">
<h3>Phrase-Based Model</h3>
<table style='font-size: 80%;'>
<tr>
<td>Berlin</td><td>ist</td><td>ein</td><td>herausragendes</td><td>Kunst- und Kulturzentrum</td><td>.</td>
</tr>
<tr>
<td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td><td style='text-align: center;'>↓</td>
</tr>
<tr>
<td>Berlin</td><td>is</td><td>an</td><td>outstanding</td><td>Art and cultural center</td><td>.</td>
</tr>
</table>
<br/>
<ul>
<li>Foreign input is segmented in phrases</li>
<li>Each phrase is translated into English</li>
<li>Phrases are reordered</li>
</ul>
</section>
<section data-background-image="img/ZsMwlm2-purpleish.jpg">
<h3>Decoding</h3>
</section>
<section data-background-image="img/ZsMwlm2-purpleish.jpg">
<ul>
<li>We have a mathematical model for translation <code>p(e|f)</code></li>
<li>Task of decoding: find the translation <code>e<sub style="font-size:50%">best</small></sub></code> with highest probability<br/>
<pre><code data-trim data-noescape>e<sub style="font-size:50%">best</sub> = <i>argmax</i> p(e|f)</code></pre>
</li>
<li>Two types of error</li>
<ul>
<li>the most probable translation is bad → fix the model</li>
<li>search does not find the most probable translation → fix the search</li>
</ul>
</ul>
</section>
<section data-background-image="img/ZsMwlm2-greenish.jpg" data-transition="none">
<h3>Translation Process</h3>
<p>Translate this query from German into English</p>
<table>
<tr>
<td>er</td><td>trinkt</td><td>ja</td><td>noch</td><td>nichts</td>
</tr>
</table>
<table id='special-table'>
<col width="30%"/><col width="1%"/><col width="30%"/><col width="1%"/><col width="30%"/>
<tr>
<td style="text-align: center;">er</td><td class='borderless'> </td><td class='borderless' style=""> </td><td class='borderless'> </td><td class='borderless'> </td>
</tr>
<tr>
<td class='borderless' style="text-align: center;">↓</td><td class='borderless'> </td><td class='borderless'> </td><td class='borderless'> </td><td class='borderless'> </td>
</tr>
<tr>
<td style="text-align: center;">he</td><td class='borderless'> </td><td class='borderless'> </td><td class='borderless'> </td><td class='borderless'> </td>
</tr>
</table>
<p>Pick and input phrase, translate</p>
</section>
<section data-background-image="img/ZsMwlm2-greenish.jpg" data-transition="none">
<h3>Translation Process</h3>
<p>Translate this query from German into English</p>
<table>
<tr>
<td>er</td><td>trinkt</td><td>ja</td><td>noch</td><td>nichts</td>
</tr>
</table>
<table id='special-table'>
<col width="30%"/><col width="1%"/><col width="30%"/><col width="1%"/><col width="30%"/>
<tr>
<td style="text-align: center;">er</td><td class='borderless'> </td><td class='borderless'></td><td class='borderless'> </td><td style="text-align: center;">ja noch nichts</td>
</tr>
<tr>
<td class='borderless' style="text-align: center;">↓</td><td class='borderless'> </td><td class='borderless'> </td><td class='borderless'><img src="img/diagonal-left.png" style="border: none; box-shadow: none; background: none; width: 0.6em; margin: 0px; max-width: none; max-height: none;" /></td><td class='borderless'> </td>
</tr>
<tr>
<td style="text-align: center;">he</td><td class='borderless'> </td><td style='text-align: center;'>does not yet</td><td class='borderless'> </td><td class='borderless'></td>
</tr>
</table>
<p>Pick and input phrase, translate</p>
</section>
<section data-background-image="img/ZsMwlm2-greenish.jpg" data-transition="none">
<h3>Translation Process</h3>
<p>Translate this query from German into English</p>
<table>
<tr>
<td>er</td><td>trinkt</td><td>ja</td><td>noch</td><td>nichts</td>
</tr>
</table>
<table id='special-table'>
<col width="30%"/><col width="1%"/><col width="30%"/><col width="1%"/><col width="30%"/>
<tr>
<td style="text-align: center;">er</td><td class='borderless'> </td><td>trinkt</td><td class='borderless'> </td><td style="text-align: center;">ja noch nichts</td>
</tr>
<tr>
<td class='borderless' style="text-align: center;">↓</td><td class='borderless'> </td><td class='borderless'> </td><td class='borderless'><img src="img/crossing.png" style="border: none; box-shadow: none; background: none; width: 0.7em; margin: 0px; max-width: none; max-height: none;" /></td><td class='borderless'> </td>
</tr>
<tr>
<td style="text-align: center;">he</td><td class='borderless'> </td><td style='text-align: center;'>does not yet</td><td class='borderless'> </td><td>drink</td>
</tr>
</table>
<p>Pick and input phrase, translate</p>
</section>
<section data-background-image="img/ZsMwlm2-yellowish.jpg">
<h3>Apache Joshua</h3>
<img src='img/joshua-logo.png' style='border: none; background: none; box-shadow: none;' width='30%' />
</section>
<section data-background-image="img/ZsMwlm2-yellowish.jpg">
<ul style='font-size: 80%;'>
<li>Statistical Machine Translation Decoder for phrase-based and hierarchical machine translation</li>
<li>Written in Java</li>
<li>Provide 64 language packs for machine translation</li>
<ul>
<li>https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs</li>
</ul>
<li>Project initiated by Johns Hopkins Univ. and University of Pennsylvania</li>
<li>Presently incubating at Apache Software Foundation</li>
<li>Used extensively by Amazon.com, NASA JPL</li>
<li>https://cwiki.apache.org/confluence/display/JOSHUA</li>
<li><i class="fa fa-twitter" aria-hidden="true"></i> @ApacheJoshua</li>
</ul>
</section>
<section data-background-color="#9E9E9E">
<h3>Flows</h3>
<img src='img/flow1.png' style='max-width: 50%; border: none; box-shadow: none; background: none; float: left;' />
<img src='img/flow2.png' style='max-width: 50%; border: none; box-shadow: none; background: none;' />
</section>
<section data-background-image="img/ZsMwlm2-yellowish.jpg">
<h3>References</h3>
<ul style='font-size: 70%'>
<li>Apache Joshua — https://cwiki.apache.org/confluence/display/JOSHUA</li>
<li>Apache OpenNLP — https://opennlp.apache.org</li>
<li>GitHub — https://github.com/smarthi/BBuzz-multilang-search</li>
<li>Slides — https://smarthi.github.io/bbuzz17-embracing-diversity-searching-over-multiple-languages/#/</li>
</ul>
</section>
<section data-background-image="img/buzzwords_2017-blueish.png">
<h3>Credits</h3>
<br/>
<ul>
<li>Joern Kottmann — PMC Chair, Apache OpenNLP</li>
<li>Matt Post — PMC Chair, Apache Joshua</li>
<li>Bruno P. Kinoshita — Committer on Apache OpenNLP, committer and PMC on Apache Commons and Apache Jena</li>
</ul>
</section>
<section data-background-image="img/ZsMwlm2.jpg">
<h2>Questions ???</h2>
</section>
</div>
</div>
<script src="lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>
<script>
// More info about config & dependencies:
// - https://github.com/hakimel/reveal.js#configuration
// - https://github.com/hakimel/reveal.js#dependencies
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'slide', // none/fade/slide/convex/concave/zoom
keyboard: true,
touch: true,
loop: false,
fragments: true,
dependencies: [
{ src: 'plugin/markdown/marked.js' },
{ src: 'plugin/markdown/markdown.js' },
{ src: 'plugin/notes/notes.js', async: true },
{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } }
]
});
</script>
</body>
</html>