-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
816 lines (753 loc) · 40.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Beyond n-grams, tf-idf, and word indicators for text</title>
<meta name="description" content="2021 Stata Conference presentation on using vector embeddings in Stata">
<meta name="author" content="Billy Buchanan">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="dist/reset.css">
<link rel="stylesheet" href="dist/reveal.css">
<link rel="stylesheet" href="dist/theme/black.css" id="theme">
<!-- Theme used for syntax highlighting of code
This theme seems to provide better color contrast with the dark background
-->
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.1.0/styles/xt256.min.css">
<style>
a {
color: white !important;
}
</style>
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="slides">
<section>
<section>
<h5>If you want to follow along:</h5>
<p>There are scripts and instructions available here:</p><br>
<p><a href="https://github.com/wbuchanan/stataConference2021" target="_blank">https://github.com/wbuchanan/stataConference2021</a></p><br>
<p>Some of the installation can take a bit of time, so you may want to start downloading/installing now.</p>
</section>
<section>
<h2>Beyond n-grams, tf-idf, and word indicators for text:</h2>
<h3>Leveraging the Python API for vector embeddings</h3><br>
<a href="https://github.com/wbuchanan" target="_blank">Billy Buchanan</a><br>
<span>Senior Research Scientist</span><br>
<span><a href="https://sagcorp.com/" target="_blank">SAG Corporation</a></span>
<aside class="notes">
<p>I'm going to move a bit faster when introducing the concepts but will try to slow things down a
bit once I get to the code snippets in case anyone is interested in following along. If you have
any questions feel free to put them into the chat/Q&A feature.</p>
<p>This talk will share strategies that Stata users can use to get more
informative word, sentence, and document vector embeddings of text
in their data. While indicator and bag-of-words strategies can be
useful for some types of text analytics, they lack the richness of
the semantic relationships between words that provide meaning and
structure to language. Vector space embeddings attempt to preserve
these relationships and in doing so can provide more robust numerical
representations of text data that can be used for subsequent analysis.
I will share strategies for using existing tools from the Python
ecosystem with Stata to leverage the advances in NLP in your Stata
workflow.</p>
</aside>
</section>
</section>
<section>
<section>
<h3>Motivation</h3>
<ul>
<li>Bag of Words (BoW) models are not always capable of modeling the meaning in natural language.</li>
<li>BoW, TF-IDF, and N-grams typically result in highly sparse matrices with large dimensions.</li>
<li>Because word order can affect semantics these methods can introduce substantial error into your models.</li>
</ul>
<aside class="notes">
<ul>
<li>NLP ultimately is about the meaning and/or ideas communicated using language.</li>
<li>I'll show some examples that illustrate how indicators, bags of words, and TF-IDF would return the same vectors despite the meaning of the sentences being different.</li>
<li>Sometimes the subject or object being modified can be more or less distant to its modifiers. I'll share some examples of this as well.</li>
<li>For a classic example "The dog bit the cat" and "The cat bit the dog" very clearly mean two different things, but with the simpler methods both would have the same vector representing the sentences.</li>
</ul>
</aside>
</section>
<section>
<table style="width: fit-content !important; font-size: 0.95rem !important;">
<caption>Bag of Words Example of Meaning Varying by Word Order<sup><a href="https://www.city-data.com/forum/writing/1115620-two-sentences-have-same-words-but-2.html#post16932155" target="_blank">1</a></sup></caption>
<colgroup>
<col span="1" style="width: 5% !important;"><col span="1" style="width: 55% !important;">
<col span="1" style="width: 5% !important;"><col span="1" style="width: 5% !important;">
<col span="1" style="width: 5% !important;"><col span="1" style="width: 5% !important;">
<col span="1" style="width: 5% !important;"><col span="1" style="width: 5% !important;">
<col span="1" style="width: 5% !important;"><col span="1" style="width: 5% !important;">
</colgroup>
<tr>
<td></td><td></td><td colspan="8" style="text-align: center;">Bag of Words Vector</td>
</tr>
<tr>
<th>ID</th><th>Sentence</th><th>he</th><th>his</th><th>her</th>
<th>loved</th><th>only</th><th>that</th><th>told</th><th>wife</th>
</tr>
<tr>
<td>1</td><td>Only he told his wife that he loved her.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
<tr>
<td>2</td><td>He only told his wife that he loved her.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
<tr>
<td>3</td><td>He told only his wife that he loved her.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
<tr>
<td>4</td><td>He told his only wife that he loved her.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
<tr>
<td>5</td><td>He told his wife only that he loved her.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
<tr>
<td>6</td><td>He told his wife that only he loved her.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
<tr>
<td>7</td><td>He told his wife that he only loved her.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
<tr>
<td>8</td><td>He told his wife that he loved only her.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
<tr>
<td>9</td><td>He told his wife that he loved her only.</td>
<td>2</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
</tr>
</table><br>
<small>Do these sentences all mean the same thing?</small>
<small>How would a model built on the bag of words vectors distinguish between the meanings?</small>
<aside class="notes">
<ul>
<li>The examples here are modified from a forum on the website city-data.com. Click on the superscript to see the original examples.</li>
<li>If these were notes taken by psychologists observing couples trying to model the likelihood of divorce, would each of these sentences indicate the same relationship?</li>
<li>N-grams could be a little useful, but to capture the difference between each of the different sentences would require the use of multiple n-grams (which isn't horrible, but can be more computationally expensive).</li>
</ul>
</aside>
</section>
<section>
<table style="width: fit-content !important; margin-left: -7.5% !important; font-size: 0.95rem !important;">
<caption>N-Gram Example of Meaning Varying by Word Order<sup><a href="https://www.city-data.com/forum/writing/1115620-two-sentences-have-same-words-but-2.html#post16932155" target="_blank">1</a></sup></caption>
<colgroup>
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;"><col span="1" style="width: 4% !important;">
</colgroup>
<tr>
<td></td>
<td colspan="22" style="text-align: center;">N-Gram Vector</td>
</tr>
<tr>
<th>Sentence ID</th>
<th>only he</th><th>he told</th><th>told his</th><th>his wife</th><th>wife that</th><th>that he</th>
<th>he loved</th><th>loved her</th><th>he only</th><th>only told</th><th>told only</th><th>only his</th>
<th>his only</th><th>only wife</th><th>wife only</th><th>only that</th><th>that only</th><th>only he</th>
<th>only loved</th><th>loved only</th><th>only her</th><th>her only</th>
</tr>
<tr>
<td>1</td>
<td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td>
</tr>
<tr>
<td>2</td>
<td>0</td><td>0</td><td>1</td><td>1</td><td>1</td><td>1</td>
<td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td>
</tr>
<tr>
<td>3</td>
<td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>1</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>1</td><td>1</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td>
</tr>
<tr>
<td>4</td>
<td>0</td><td>1</td><td>1</td><td>0</td><td>1</td><td>1</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td>
</tr>
<tr>
<td>5</td>
<td>0</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>1</td><td>1</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td>
</tr>
<tr>
<td>6</td>
<td>0</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td>
<td>0</td><td>0</td><td>0</td><td>0</td>
</tr>
<tr>
<td>7</td>
<td>0</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
<td>0</td><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>1</td><td>0</td><td>0</td><td>0</td>
</tr>
<tr>
<td>8</td>
<td>0</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
<td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>1</td><td>1</td><td>0</td>
</tr>
<tr>
<td>9</td>
<td>0</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>0</td><td>0</td><td>0</td><td>1</td>
</tr>
</table><br>
<small>To accurately model the meaning of the sentences, how much sparser would the matrix need to get?</small>
<small>How many different n-grams would need to be used to capture that information?</small>
<aside class="notes">
<ul>
<li>With bi-grams we can see how the matrix begins to becoming more sparse.</li>
<li>Although dimensionality on it's own may not be a problem, it becomes a bigger issue in the context of sparse matrices.</li>
<li>Bi-grams are able to capture a little bit of additional information, but it still doesn't capture everything.</li>
</ul>
</aside>
</section>
<section>
<h3>How do the simpler methods work when trying to measure some form of sentiment?</h3>
</section>
<section>
<table style="width: fit-content !important; font-size: 0.95rem !important;">
<caption>BoW Example for Sentiment</caption>
<colgroup>
<col span="1" style="width: 3% !important;"><col span="1" style="width: 60% !important;">
<col span="1" style="width: 3% !important;"><col span="1" style="width: 3% !important;">
<col span="1" style="width: 3% !important;"><col span="1" style="width: 3% !important;">
<col span="1" style="width: 3% !important;"><col span="1" style="width: 3% !important;">
<col span="1" style="width: 3% !important;"><col span="1" style="width: 3% !important;">
<col span="1" style="width: 3% !important;"><col span="1" style="width: 3% !important;">
<col span="1" style="width: 3% !important;"><col span="1" style="width: 3% !important;">
<col span="1" style="width: 3% !important;"><col span="1" style="width: 3% !important;">
<col span="1" style="width: 3% !important;"><col span="1" style="width: 3% !important;">
<col span="1" style="width: 3% !important;">
</colgroup>
<tr>
<td></td><td></td>
<td colspan="17" style="text-align: center;">BoW Vector</td>
</tr>
<tr>
<th>Sentence ID</th><th>Sentence</th>
<th>I</th><th>apples</th><th>are</th><th>as</th><th>bad</th>
<th>be</th><th>did</th><th>expect</th><th>expected</th><th>half</th>
<th>not</th><th>of</th><th>the</th><th>this</th><th>to</th>
<th>were</th><th>would</th>
</tr>
<tr>
<td>1</td><td>I did not expect the apples to be this bad.</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>1</td>
<td>1</td><td>1</td><td>1</td><td>0</td><td>0</td>
<td>1</td><td>0</td><td>1</td><td>1</td><td>1</td>
<td>0</td><td>0</td>
</tr>
<tr>
<td>2</td><td>This half of the apples are bad.</td>
<td>0</td><td>1</td><td>1</td><td>0</td><td>1</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>1</td>
<td>0</td><td>1</td><td>1</td><td>1</td><td>0</td>
<td>0</td><td>0</td>
</tr>
<tr>
<td>3</td><td>The apples were not half bad.</td>
<td>0</td><td>1</td><td>0</td><td>0</td><td>1</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>1</td>
<td>1</td><td>0</td><td>1</td><td>0</td><td>0</td>
<td>1</td><td>0</td>
</tr>
<tr>
<td>4</td><td>Half of the apples were not bad.</td>
<td>0</td><td>1</td><td>0</td><td>0</td><td>1</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>1</td>
<td>1</td><td>1</td><td>1</td><td>0</td><td>0</td>
<td>1</td><td>0</td>
</tr>
<tr>
<td>5</td><td>The apples were not half as bad as I expected.</td>
<td>1</td><td>1</td><td>0</td><td>2</td><td>1</td>
<td>0</td><td>0</td><td>0</td><td>1</td><td>1</td>
<td>1</td><td>0</td><td>1</td><td>0</td><td>0</td>
<td>1</td><td>0</td>
</tr>
<tr>
<td>6</td><td>I expected the apples would not be half bad.</td>
<td>1</td><td>1</td><td>0</td><td>0</td><td>1</td>
<td>1</td><td>0</td><td>0</td><td>1</td><td>1</td>
<td>1</td><td>0</td><td>1</td><td>0</td><td>0</td>
<td>0</td><td>1</td>
</tr>
<tr>
<td>7</td><td>The apples were not bad.</td>
<td>0</td><td>1</td><td>0</td><td>0</td><td>1</td>
<td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
<td>1</td><td>0</td><td>1</td><td>0</td><td>0</td>
<td>1</td><td>0</td>
</tr>
</table><br>
<small></small>
<aside class="notes">
<ul>
<li></li>
<li></li>
<li></li>
</ul>
</aside>
</section>
<section>
<table style="font-size: 0.95rem !important;">
<caption>Cosine Distances Between Sentiment Examples</caption>
<colgroup>
<col span="1" style="width: 60% !important;">
<col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;">
<col span="1" style="width: 4% !important;">
</colgroup>
<tr><th></th><th colspan="7" style="text-align: center;">Distance to Other Sentence</th></tr>
<tr><th>Source Sentence</th><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th></tr>
<tr><td>I did not expect the apples to be this bad.</td><td>0</td><td>0.48</td><td>0.52</td><td>0.48</td><td>0.46</td><td>0.63</td><td>0.57</td></tr>
<tr><td>This half of the apples are bad.</td><td>0.48</td><td>0</td><td>0.62</td><td>0.71</td><td>0.44</td><td>0.50</td><td>0.51</td></tr>
<tr><td>The apples were not half bad.</td><td>0.52</td><td>0.62</td><td>0</td><td>0.93</td><td>0.71</td><td>0.68</td><td>0.91</td></tr>
<tr><td>Half of the apples were not bad.</td><td>0.48</td><td>0.71</td><td>0.93</td><td>0</td><td>0.65</td><td>0.63</td><td>0.85</td></tr>
<tr><td>The apples were not half as bad as I expected.</td><td>0.46</td><td>0.44</td><td>0.71</td><td>0.65</td><td>0</td><td>0.67</td><td>0.65</td></tr>
<tr><td>I expected the apples would not be half bad.</td><td>0.63</td><td>0.50</td><td>0.68</td><td>0.63</td><td>0.67</td><td>0</td><td>0.60</td></tr>
<tr><td>The apples were not bad.</td><td>0.57</td><td>0.51</td><td>0.91</td><td>0.85</td><td>0.65</td><td>0.60</td><td>0</td></tr>
</table><br>
<small>Do these distances accurately reflect how similar you would judge the sentiment contained in the sentences?</small>
<aside class="notes">
<ul>
<li>If you compare the first and last sentences, you'll see that they are closer than the first and sixth sentence.</li>
<li>This is counterintuitive, to say the least, because the last sentence indicates all apples were not in a negative state, while the sixth sentence contains a more ambiguous sentiment.</li>
<li>Similarly, sentence 2 (This half of the apples are bad) is closer to sentence 5 (The apples were not half as bad as I expected) than it is to sentence 4 (Half of the apples were not bad).</li>
<li>This is problematic since sentence 2 conveys negative sentiment and sentence 5 conveys a positive sentiment while sentence 4 conveys a more ambiguous sentiment.</li>
</ul>
</aside>
</section>
<section data-autoslide="4000">
<h3>How do vector embeddings solve these issues?</h3>
</section>
<section>
<ul>
<li>Reduces sparsity of the data matrix.</li>
<li>Uses a fixed number of dimensions to represent word meanings and context simultaneously.</li>
<li>Vector embeddings can be aggregated to generate embeddings for hierarchical units of language.</li>
<li>Can provide information based on the character/sub-word level that is informative.</li>
</ul>
<aside class="notes">
<ul>
<li>Vector embeddings are fixed dimensional representations of the words unlike bag of words/TF-IDF which grow as a function of the number of unique tokens in the corpus.</li>
<li>Many deep neural network-based methods will return hundreds, thousands, or more dimensions in their embeddings.</li>
<li>While the simpler methods could also be aggregated, it isn't clear if it would be beneficial.</li>
<li>For example, using fastText and other similar models, the n-gram embeddings at the character level are aggregated with character vectors to form the word vector.</li>
<li>This particular feature is important when many words share similar meanings via construction and morphological features.</li>
</ul>
</aside>
</section>
<section data-autoslide="4000">
<h3>Vector embeddings are not the panacea to your NLP related problems</h3>
</section>
<section>
<h4>Limitations/Disadvantages</h4>
<ul>
<li>Interpretability</li>
<li>Reproducibility<sup>*</sup></li>
<li>Domain Specificity/Generalizability</li>
<li>Computational Time<sup>*</sup></li>
</ul>
<aside class="notes">
<ul>
<li>Unlike indicators for individual tokens, there isn't an easy way to interpret the dimensions of a word embedding.</li>
<li>While interpretability of the individual dimensions isn't an issue in the context of predictive modeling, embeddings may not be useful if the interest is in estimating parameters related to individual words.</li>
<li>Depending on the model and package being used, it may not be possible/easy to reproduce the embeddings exactly.</li>
<li>This is due to the use of a randomized starting vector and/or tuning the model to your data via tuning/training.</li>
<li>Any modern language model will necessarily have some degree of domain specificity inherent to it. This means that while one pre-trained model may be amazing for one task, it may behave like Donald Trump not getting his way with your data and throw a huge temper tantrum.</li>
<li>However, there are new language models being released and shared all the time which may be close enough to your use case to be useful.</li>
<li>It can definitely take longer at times to get word embeddings and push them back into Stata compared with creating Bag of Words representations.</li>
<li>If you are tuning a pre-trained model to your data, the computational overhead can definitely increase significantly.</li>
<li>In that case, I would strongly recommend using a system that has one or more GPUs available so you can get the benefit of the GPUs while tuning/training the model generating the embeddings.</li>
</ul>
</aside>
</section>
</section>
<section>
<section data-autoslide="4500">
<h3>Getting Started</h3>
</section>
<section>
<table style="font-size: 2rem !important; width: fit-content !important;">
<caption>Python Packages for Vector Embeddings</caption>
<tr>
<th>Package Name</th><th>CUDA</th><th>pip</th><th>conda</th>
</tr>
<tr>
<td><a href="https://spacy.io/" target="_blank">spaCy</a><sup>*†</sup></td><td>Y</td><td>Y</td><td>Y</td>
</tr>
<tr>
<td><a href="https://huggingface.co/transformers/" target="_blank">transformers</a><sup>*†</sup></td><td>Y</td><td>Y</td><td>Y</td>
</tr>
<tr>
<td><a href="https://radimrehurek.com/gensim/#" target="_blank">gensim</a><sup>*</sup></td><td>N</td><td>Y</td><td>Y</td>
</tr>
<tr>
<td><a href="https://nlp.stanford.edu/projects/glove/" target="_blank">GloVe</a></td><td>N</td><td>Y</td><td>Y</td>
</tr>
<tr>
<td><a href="https://fasttext.cc/" target="_blank">fastText</a></td><td>N</td><td>Y</td><td>Y</td>
</tr>
<tr>
<td><a href="https://textblob.readthedocs.io/en/dev/" target="_blank">TextBlob</a></td><td>N</td><td>Y</td><td>Y</td>
</tr>
<tr>
<td><a href="https://www.nltk.org/" target="_blank">NLTK<sup>‡</sup></a></td><td>N</td><td>Y</td><td>Y</td>
</tr>
<tr>
<td><a href="" target="_blank">simplerepresentations</a></td><td>N/A</td><td>Y</td><td>N</td>
</tr>
</table>
<div style="width: 125% !important;">
<small style="font-size: 1.15rem !important;"><sup>*</sup> These packages provide access to several pre-trained models used to generate vector embeddings.</small>
<small style="font-size: 1.15rem !important;"><sup>†</sup> These packages will be used for subsequent examples.</small>
<small style="font-size: 1.15rem !important;"><sup>‡</sup> While the Natural Language ToolKit (NLTK) doesn't provide word embeddings, it has a lot of other useful tools for working with text.</small>
</div>
<aside class="notes">
<ul>
<li>Here is a list of packages that you should be able to install using pip or conda.</li>
<li>I'm only going to use a few of these packages for the examples, but know there are many packages and models to do this work in the Python ecosystem.</li>
<li>For the sake of flexibility, simplicity, and speed, we'll focus on just a couple of examples using transformers and spaCy</li>
<li>Simplerepresentations is a wrapper module so it relies on transformers under the hood.</li>
</ul>
</aside>
</section>
<section>
<h5>Installing spaCy</h5>
<pre data-id="code-animation" style="width: fit-content !important; font-size: 0.95rem !important;"><code class="hljs" data-trim data-line-numbers="1-6,12-15|1-2,7-8,12-15|1,2,9-15|12,13|14,15"><script type="text/template">
# Installing spaCy using pip
$ pip install -U pip setuptools wheel
# Use this line if you have no intention to train models
$ pip install -U spacy
# Or to install using conda:
$ conda install -c conda-forge spacy
# Use this line instead if you want to be able to train models
$ pip install -U spacy[transformers,lookups]
# If you want to add CUDA support, add it as an option like this:
# where the ### following cuda is the version number (e.g., 102 = CUDA 10.2)
$ pip install -U spacy[cuda111,transformers,lookups]
# This is necessary before using spaCy and downloads the pretrained model
$ python -m spacy download en_core_web_sm
# For the accuracy optimized pre-trained model use this line instead
$ python -m spacy download en_core_web_lg
</script></code></pre>
<aside class="notes">
<ul>
<li>I opted to go the route of installing for accuracy instead of speed.</li>
<li>There is a fairly large number of dependencies that get installed with spaCy including:
<ul>
<li>catalogue</li>
<li>cymem</li>
<li>cython-blis</li>
<li>murmurhash</li>
<li>pathy</li>
<li>preshed</li>
<li>pydantic</li>
<li>shellingham</li>
<li>smart_open</li>
<li>spacy-legacy</li>
<li>srsly</li>
<li>thinc</li>
<li>typer</li>
<li>wasabi</li>
</ul>
</li>
<li>Downloading the model can take a bit of time, so be patient.</li>
<li>The large model is roughly 777 MB and the medium model is 48MB. The difference is in the number of tokens included in the model.</li>
</ul>
</aside>
</section>
<section>
<h5>Installing Transformers</h5>
<pre data-id="code-animation" style="width: fit-content !important; font-size: 1.15rem !important;">
<code class="hljs" data-trim data-line-numbers="1-8|3-4,9-12"><script type="text/template">
# If TensorFlow 2.0 and/or PyTorch are already installed
$ pip install transformers
# For CPU support via PyTorch:
$ pip install transformers[torch]
# For CPU support via TensorFlow
$ pip install transformers[tf-cpu]
# To install with Flax
$ pip install transformers[flax]
# To install via conda
$ conda install -c huggingface transformers
# If you plan to use transformers, you may want to use this module as well
$ pip install simplerepresentations
</script></code></pre>
<aside class="notes">
<ul>
<li>I created/tested this on my 2013 MacBook Pro, so I went with conda</li>
<li>This also involves installing the huggingface hub, protobuf, sacremoses, tokenizers, and typing-extensions packages as well.</li>
<li>Torch is roughly 128MB in size.</li>
</ul>
</aside>
</section>
<section>
<h3>Get Stata's Python Interpretter Up and Running</h3>
<pre data-id="code-animation" style="width: fit-content !important; font-size: 1.15rem !important;"><code class="lang-python" data-trim>
# The examples that I'll talk through will use spaCy, but I've included an example
# that uses some transformers based models in the Jupyter notebook on GitHub
import json
import requests
import pandas as pd
from sfi import ValueLabel, Data, SFIToolkit
import spacy
import torch
torch.manual_seed(0)
# This will load the tokenizers and models using the BERT architecture
from transformers import BertTokenizer, BertModel
# This will initialize the tokenizer and download the pretrained model parameters
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case = False)
# We'll also load up the model for spaCy at this time
nlp = spacy.load('en_core_web_lg')
</code></pre>
<aside class="notes">
<ul>
<li>Also mention that the notebook is slightly different from what will be shown here due to the differences in the APIs.</li>
<li>Some of the deep learning models available from Huggingface yield vectors with thousands of dimensions</li>
<li>Even with only a few hundred dimensions and using spaCy, you are likely to run into some computing constraints.</li>
<li>If you have access to a server with a fair amount of RAM, that would be the best place to do some of this work and then you can use a local machine for model fitting.</li>
</ul>
</aside>
</section>
<section>
<h3>Get data from source</h3>
<pre data-id="code-animation" style="width: fit-content !important; font-size: 1.15rem !important; margin-left: -15% !important;">
<code class="lang-python" data-trim data-line-numbers="1-4|6-15|16-26">
# List of the URLs containing the data set
files = [ "https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/test.jsonl",
"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/train.jsonl",
"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/validation.jsonl" ]
# Function to handle dropping "variables" that prevent pandas from
# reading the JSON object
def normalizer(obs: dict, drop: list) -> pd.DataFrame:
# Loop over the "variables" to drop
for i in drop:
# Remove it from the dictionary object
del obs[i]
# Returns the Pandas dataframe
return pd.DataFrame.from_dict(obs)
# Object to store each of the data frames
data = []
# Loop over each of the files from the URLs above
for i in files:
# Get the raw content from the GitHub location
content = requests.get(i).content
# Split the JSON objects by new lines, pass each individual line to json.loads,
# pass the json.loads value to the normalizer function, and
# append the result to the data object defined outside of the loop
[ data.append(normalizer(json.loads(i), [ "players", "game_id" ])) for i in content.decode('utf-8').splitlines() ]
</code></pre>
</section>
<section>
<h3>Prep Data for Stata</h3>
<pre data-id="code-animation" style="width: fit-content !important; font-size: 1.15rem !important; margin-left: -15% !important;">
<code class="lang-python" data-trim data-line-numbers="1-4|6-7|9-15|17-25|26-36">
# Define a couple data mappings for later use
labmap = { True: 1, False: 0, 'NOANNOTATION': -1 }
cntrys = { 'austria': 0, 'england': 1, 'france': 2, 'germany': 3, 'italy': 4, 'russia': 5, 'turkey': 6 }
seasons = { 'Fall': 0, 'Winter': 1, 'Spring': 2 }
# Combine each of the data frames for each game into one large dataset
dataset = pd.concat(data, axis = 0, join = 'inner', ignore_index = True, sort = False)
# Recast data to appropriate types
dataset['game_score'] = dataset['game_score'].astype('int')
dataset['sender_labels'] = dataset['sender_labels'].astype('int')
dataset['absolute_message_index'] = dataset['absolute_message_index'].astype('int')
dataset['relative_message_index'] = dataset['relative_message_index'].astype('int')
dataset['game_score_delta'] = dataset['game_score_delta'].astype('int')
dataset['years'] = dataset['years'].astype('int')
# Recodes text labels to numeric values
dataset.replace({'receiver_labels': labmap, 'speakers': cntrys, 'receivers': cntrys, 'seasons': seasons}, inplace = True)
# Creates an indicator for when the receiver correctly identifies the truthfulness of the message
dataset['correct'] = (dataset['sender_labels'] == dataset['receiver_labels']).astype('int')
# Get the number of tokens per message using spaCy's tokenizer
dataset['tokens'] = dataset['messages'].apply(lambda x: len(nlp(x)))
# This stores the spaCy object in a new variable named token
dataset['token'] = dataset['messages'].apply(lambda x: nlp(x))
# Now the data set can be expanded by unique tokens
dataset = dataset.explode('token')
# Make sure the token variable is cast as a string
dataset['token'] = dataset['token'].astype('str')
# Then add ID's for each token
dataset['tokenid'] = dataset.groupby('messages').cumcount()
</code></pre>
<aside class="notes">
<ul>
<li>The first line of code above will take a bit to execute, but it will work to parse the message into individual words and expand the dataset for each word in each message.</li>
<li>If you don't recast the token variable it will create an error when you try to load it into Stata</li>
</ul>
</aside>
</section>
<section>
<h3>Load Data into Stata and Store Embeddings</h3>
<pre data-id="code-animation" style="width: fit-content !important; font-size: 1.15rem !important; margin-left: -15% !important;">
<code class="lang-python" data-trim data-line-numbers="1-6|7-24|25-40|41-51">
# Get the names of the variables
varnms = dataset.columns
# Sets the number of observations based on the messages column
Data.setObsTotal(len(dataset['messages']))
# Create the variables in Stata
for var in varnms:
# The messages and token variables are both string types
if var not in [ 'messages', 'token' ]:
# Adds the numeric variables to the data set
Data.addVarLong(var)
# We'll make the string types strLs just to make sure there won't be any storage issues
else:
# Adds the strL for the string variables
Data.addVarStrL(var)
# Now push the data into Stata
Data.store(var = None, obs = None, val = dataset.values.tolist())
# Create mapping of value labels to variables
vallabmap = { 'sender_labels' : labmap, 'receiver_labels': labmap,
'seasons': seasons, 'speakers': cntrys, 'receivers': cntrys }
# Loop over the dictionary containing the value label mappings
for varnm, vallabs in vallabmap.items():
# Create the value label
ValueLabel.createLabel(varnm)
# Now iterate over the value label mappings and assign to the appropriate value label
[ ValueLabel.setLabelValue(varnm, value, str(label)) for label, value in vallabs.items() ]
# Then assign the value label to the variable
ValueLabel.setVarValueLabel(varnm, varnm)
# Now create the variables to store the dimensions of the vector embedding
[ Data.addVarDouble('wembed' + str(i)) for i in range(1, 301) ]
# Gets all of the tokens and include a sequence ID in the iteration
for ob, token in enumerate(dataset['token'].tolist()):
# Gets the spaCy embedding for this token
embed = nlp(token)
# Store the word vector for this word in the variables we just created
[ Data.storeAt("wembed" + str(dim + 1), ob, embed.vector[dim]) for dim in range(0, len(embed.vector)) ]
</code></pre>
<aside class="notes">
<ul>
<li>There are a few big differences between this script and the Jupyter Notebook.</li>
<li>While I could have taken the same approach to constructing the command string and executing it here as I did in the notebook, it was more efficient to build the value labels dynamically.</li>
<li>The biggest difference is that doing things this way it is possible to reduce memory overhead by working on a single observation at a time.</li>
</ul>
</aside>
</section>
<section>
<h3>Fit a Model and Get Document/Message Embeddings Instead</h3>
<pre data-id="code-animation" style="width: fit-content !important; font-size: 1.15rem !important; margin-left: -15% !important;">
<code class="lang-python" data-trim data-line-numbers="1-3|4-7|8-10|11-21">
# You can now fit a model to the data:
SFIToolkit.stata("logit correct i.speakers i.seasons i.years i.game_score wembed1-wembed300")
# These results are fairly noisy, so maybe there would be better luck using document vectors
SFIToolkit.stata("drop token tokenid wembed*")
SFIToolkit.stata("duplicates drop")
# Now use the same process used above, but using document vectors
[ Data.addVarDouble('docembed' + str(i)) for i in range(1, 301) ]
# Then iterate over the messages (instead of individual tokens)
for ob, token in enumerate(dataset['messages'].tolist()):
# Gets the spaCy embedding for the message
embed = nlp(token)
# Stores the document/message/sentence embedding for this record
[ Data.storeAt("docembed" + str(dim + 1), ob, embed.vector[dim]) for dim in range(0, len(embed.vector)) ]
# This model fits the data a bit better than the previous model and is also noticably faster.
SFIToolkit.stata("logit correct i.speakers i.seasons i.years i.game_score docembed1-docembed300")
</code></pre>
<aside class="notes">
<ul>
<li>There are a few big differences between this script and the Jupyter Notebook.</li>
<li>While I could have taken the same approach to constructing the command string and executing it here as I did in the notebook, it was more efficient to build the value labels dynamically.</li>
<li>The biggest difference is that doing things this way it is possible to reduce memory overhead by working on a single observation at a time.</li>
</ul>
</aside>
</section>
</section>
<section>
<section data-autoslide="4500">
<h2>Wrapping Up</h2>
</section>
<section>
<ul>
<li>Be mindful of compute resource consumption and availability.</li>
<li>The Python API and pystata have different functionality.</li>
<li>Look up information about available models and their training contexts.</li>
<li>You may need to train the model on your data for it to product informative embeddings.</li>
</ul>
<aside class="notes">
<ul>
<li>The Python API will provide a bit more flexibility with compute consumption by allowing you to work in something analogous to a streaming interface (e.g., streaming observations).</li>
<li>If you have substantial compute resources available you may be able to do everything in larger batches and can use notebooks effectively there as well.</li>
<li>The models in the transformers library all return embeddings with different dimensions.</li>
<li>Aside from an awareness of variable limits in Stata, you should also think about how the additional dimensions affect computational performance.</li>
<li>More importantly, there are highly context specific models developed and shared openly that can be used to provide a reasonable starting point (e.g., SciBert, etc...) and a lot of work is being done in the medical field with electronic health records.</li>
<li>If you need to fine tune or train the last layer or two of a pre-trained model, it may be better to manage that workflow largely in Python to avoid any additional competition for computing resources.</li>
</ul>
</aside>
</section>
<section>
<img src="https://www.dur.ac.uk/images/geography/staff/cox.jpg" alt="Image of Nicholas J Cox">
<blockquote cite="https://www.dur.ac.uk/directory/profile/?id=335">
"It's always good to end with a slogan."
- Nicholas J. Cox,
North American Stata Users Group Conference 2021
</blockquote>
</section>
</section>
</div>
</div>
<script src="dist/reveal.js"></script>
<script src="plugin/zoom/zoom.js"></script>
<script src="plugin/notes/notes.js"></script>
<script src="plugin/search/search.js"></script>
<script src="plugin/markdown/markdown.js"></script>
<script src="plugin/highlight/highlight.js"></script>
<script>
// Also available as an ES module, see:
// https://revealjs.com/initialization/
Reveal.initialize({
controls: true,
progress: true,
center: true,
hash: true,
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealZoom, RevealNotes, RevealSearch, RevealMarkdown, RevealHighlight ]
});
</script>
</body>
</html>