-
Notifications
You must be signed in to change notification settings - Fork 0
/
vectorspace-intro.qmd
406 lines (342 loc) · 18.6 KB
/
vectorspace-intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
# Introduction to Vector Space {#sec-vectorspace-intro}
```{r setup}
#| echo: false
#| include: false
source("_common.R")
```
Thus far we have covered various forms of counting. But more advanced methods in NLP often rely on *comparing* instead. To understand these methods, we must get comfortable with the idea of vector space.
This chapter is a basic introduction to the concept of representing documents as vectors. We also introduce two basic vector-based measurement techniques: Euclidean distance and cosine similarity. A more advanced and in-depth guide to navigating vector space will be covered in @sec-navigating-vectorspace.
------------------------------------------------------------------------
A fictional example[^vectorspace-intro-1]: Daniel and Amos filled out a psychology questionnaire. The questionnaire measured three aspects of their personalities: extraversion, openness to experience, and neuroticism.
[^vectorspace-intro-1]: This section is adapted from @alammar_2019
```{r}
#| echo: false
personality <- tribble(
~person, ~extraversion, ~openness, ~neuroticism,
"Daniel", 4, 6, 5,
"Amos", 2, 7, 3
)
```
On the extraversion scale, Amos scored a 2 and Daniel scored a 4:
```{r}
#| echo: false
personality |>
ggplot(aes(1, extraversion, fill = extraversion)) +
# axis scale
annotate("segment",
x = c(1, rep.int(.9, 7)), xend = c(1, rep.int(1.1, 7)),
y = c(1, 1:7), yend = c(7, 1:7)) +
geom_point(size = 5, shape = 21) +
scale_y_continuous(breaks = 1:7, limits = c(-1.5,7.5)) +
# number tiles
annotate("tile", color = "black",
x = 1:3, y = -1, width = .9, height = .9) +
geom_tile(aes(x = 1, y = -1), width = .8, height = .8) +
geom_text(aes(x = 1, y = -1, label = extraversion), size = 5) +
# colors
scale_fill_gradientn(limits = c(1,7), guide = "none",
colours = hcl.colors(7, palette = "Blue-Red")) +
# general
coord_equal() +
facet_wrap(~person) +
labs(x = "", y = "Extraversion") +
theme_void() +
theme(panel.grid.major.y = element_line(color = "#E5E5E5"),
axis.text.y = element_text(),
axis.title = element_text(angle = 90, hjust = .67, size = 15,
margin = margin(r = 15)),
strip.text = element_text(size = 20))
```
On the openness scale, Amos scored a 7 and Daniel scored a 6:
```{r}
#| echo: false
personality |>
ggplot(aes(2, openness, fill = openness)) +
# axis scale
annotate("segment",
x = c(2, rep.int(1.9, 7)), xend = c(2, rep.int(2.1, 7)),
y = c(1, 1:7), yend = c(7, 1:7)) +
geom_point(size = 5, shape = 21) +
scale_y_continuous(breaks = 1:7, limits = c(-1.5,7.5)) +
# number tiles
annotate("tile", color = "black",
x = 1:3, y = -1, width = .9, height = .9) +
geom_tile(aes(x = 1, y = -1, fill = extraversion), width = .8, height = .8) +
geom_tile(aes(x = 2, y = -1), width = .8, height = .8) +
geom_text(aes(x = 1, y = -1, label = extraversion), size = 5) +
geom_text(aes(x = 2, y = -1, label = openness), size = 5) +
# colors
scale_fill_gradientn(limits = c(1,7), guide = "none",
colours = hcl.colors(7, palette = "Blue-Red")) +
# general
coord_equal() +
facet_wrap(~person) +
labs(x = "", y = "Openness") +
theme_void() +
theme(panel.grid.major.y = element_line(color = "#E5E5E5"),
axis.text.y = element_text(),
axis.title = element_text(angle = 90, hjust = .67, size = 15,
margin = margin(r = 15)),
strip.text = element_text(size = 20))
```
On the neuroticism scale, Amos scored a 3 and Daniel scored a 5.
```{r}
#| echo: false
personality |>
ggplot(aes(3, neuroticism, fill = neuroticism)) +
# axis scale
annotate("segment",
x = c(3, rep.int(2.9, 7)), xend = c(3, rep.int(3.1, 7)),
y = c(1, 1:7), yend = c(7, 1:7)) +
geom_point(size = 5, shape = 21) +
scale_y_continuous(breaks = 1:7, limits = c(-1.5,7.5)) +
# number tiles
annotate("tile", color = "black",
x = 1:3, y = -1, width = .9, height = .9) +
geom_tile(aes(x = 1, y = -1, fill = extraversion), width = .8, height = .8) +
geom_tile(aes(x = 2, y = -1, fill = openness), width = .8, height = .8) +
geom_tile(aes(x = 3, y = -1), width = .8, height = .8) +
geom_text(aes(x = 1, y = -1, label = extraversion), size = 5) +
geom_text(aes(x = 2, y = -1, label = openness), size = 5) +
geom_text(aes(x = 3, y = -1, label = neuroticism), size = 5) +
# colors
scale_fill_gradientn(limits = c(1,7), guide = "none",
colours = hcl.colors(7, palette = "Blue-Red")) +
# general
coord_equal() +
facet_wrap(~person) +
labs(x = "", y = "Neuroticism") +
theme_void() +
theme(panel.grid.major.y = element_line(color = "#E5E5E5"),
axis.text.y = element_text(),
axis.title = element_text(angle = 90, hjust = .67, size = 15,
margin = margin(r = 15)),
strip.text = element_text(size = 20))
```
We can now represent each person's personality as a list of three numbers, or a *three dimensional vector*. We can graph these vectors in three dimensional vector space:
```{r}
#| echo: false
#| fig-width: 6
#| fig-height: 1
personality |>
ggplot(aes(3, neuroticism, fill = neuroticism)) +
# number tiles
annotate("tile", color = "black",
x = 1:3, y = -1, width = .9, height = .9) +
geom_tile(aes(x = 1, y = -1, fill = extraversion), width = .8, height = .8) +
geom_tile(aes(x = 2, y = -1, fill = openness), width = .8, height = .8) +
geom_tile(aes(x = 3, y = -1), width = .8, height = .8) +
geom_text(aes(x = 1, y = -1, label = extraversion), size = 5) +
geom_text(aes(x = 2, y = -1, label = openness), size = 5) +
geom_text(aes(x = 3, y = -1, label = neuroticism), size = 5) +
# colors
scale_fill_gradientn(limits = c(1,7), guide = "none",
colours = hcl.colors(7, palette = "Blue-Red")) +
# general
coord_equal() +
facet_wrap(~person) +
labs(x = "", y = "") +
theme_void() +
theme(axis.title = element_text(angle = 90, hjust = .67, size = 15,
margin = margin(r = 15)),
strip.text = element_text(size = 20))
```
```{r}
#| echo: false
#| warning: false
library(plotly)
personality |>
plot_ly(x = ~extraversion,
y = ~openness,
z = ~neuroticism,
split = ~person) |>
add_markers() |>
add_text(text = ~person) |>
add_paths(
data = personality |>
bind_rows(mutate(personality,
extraversion = 0,
openness = 0,
neuroticism = 0)) |>
group2NA("person")) |>
layout(xaxis = list(zerolinecolor = 'black',
zerolinewidth = 10),
showlegend = FALSE)
```
Now imagine that we encounter a third person, Elizabeth. We would like to know whether Elizabeth is more similar to Daniel or to Amos.
```{r}
#| echo: false
#| warning: false
personality <- personality |>
add_row(person = "Elizabeth",
extraversion = 8,
openness = 4,
neuroticism = 6)
personality |>
rename(Dim1 = extraversion, Dim2 = openness, Dim3 = neuroticism) |>
plot_ly(x = ~Dim1,
y = ~Dim2,
z = ~Dim3,
split = ~person) |>
add_markers() |>
add_text(text = ~person) |>
add_paths(
data = personality |>
rename(Dim1 = extraversion, Dim2 = openness, Dim3 = neuroticism) |>
bind_rows(mutate(personality,
Dim1 = 0,
Dim2 = 0,
Dim3 = 0)) |>
group2NA("person")) |>
layout(xaxis = list(zerolinecolor = 'black',
zerolinewidth = 10),
showlegend = FALSE)
```
After graphing all three people in three-dimensional vector space, it becomes obvious that Elizabeth is more similar to Daniel than she is to Amos. Thinking of people (or any sort of observations) as vectors is powerful because it allows us to apply geometric reasoning to data. The beauty of this approach is that we can measure these things without knowing anything about what the dimensions represent. This will be important later.
Thus far we have discussed three-dimensional vector space. But what if we want to measure personality with the full Big-Five traits---openness, conscientiousness, extraversion, agreeableness, and neuroticism? Five dimensions would make it impossible to graph the data in an intuitive way as we have done above, but in a mathematical sense, it doesn't matter. We can measure distance---and many other geometric concepts---just as easily in five-dimensional vector space as in three dimensions.
## Distance and Similarity
When we added Elizabeth to the graph above, we could tell that she was more similar to Daniel than to Amos just by looking at the graph. But how do we quantify this similarity or difference?
### Euclidean Distance {#sec-euclidean-distance}
The most straightforward way to measure the similarity between two points in space is to measure the distance between them. *Euclidean distance* is the simplest sort of distance---the length of the shortest straight line between the two points. The Euclidean distance between two vectors $A$ and $B$ can be calculated in any number of dimensions $n$ using the following formula:
$$
d\left( A,B\right) = \sqrt {\sum _{i=1}^{n} \left( A_{i}-B_{i}\right)^2 }
$$
**A low Euclidean distance means two vectors are very similar**. Let's calculate the Euclidean distance between Daniel and Elizabeth, and between Amos and Elizabeth:
```{r}
# dataset
personality
# Elizabeth's vector
eliza_vec <- personality |>
filter(person == "Elizabeth") |>
select(extraversion:neuroticism) |>
as.numeric()
# Euclidean distance function
euc_dist <- function(x, y){
diff <- x - y
sqrt(sum(diff^2))
}
# distance between Elizabeth and each person
personality_dist <- personality |>
rowwise() |>
mutate(
dist_from_eliza = euc_dist(c_across(extraversion:neuroticism), eliza_vec)
)
personality_dist
```
We now see that the closest person to Elizabeth is... Elizabeth herself, with a distance of 0. After that, the closest is Daniel. So we can conclude that Daniel has a more Elizabeth-like personality than Amos does.
### Cosine Similarity {#sec-cosine-similarity}
Besides Euclidean distance, the most common way to measure the similarity between two vectors is with cosine similarity. This is the cosine of the angle between the two vectors. Since the cosine of 0 is 1, **a high cosine similarity (close to 1) means two vectors are very similar**.
```{r}
#| echo: false
#| warning: false
library(patchwork)
cos_plot <- personality |>
filter(person != "Daniel") |>
mutate(person = if_else(person == "Elizabeth", "Vector A", "Vector B")) |>
ggplot() +
geom_segment(aes(openness, neuroticism, xend = 0, yend = 0)) +
ggforce::geom_arc(
aes(x0 = 0, y0 = 0, r = 4, start = 1.1659, end = 0.588),
linewidth = 2, color = "red4", arrow = arrow(ends = "both")
) +
annotate("text", label = "cos(θ) = 0.84",
x = 4.5, y = 3.1, color = "red4", fontface = "bold") +
geom_point(aes(openness, neuroticism, color = person), size = 4) +
geom_text(aes(openness + .5, neuroticism + .5, label = person)) +
scale_color_manual(
breaks = c("Vector A", "Vector B"),
values = c("forestgreen", "purple")
) +
guides(color = "none") +
coord_fixed(xlim = c(0, 8), ylim = c(0, 7)) +
labs(title = "Cosine Similarity") +
theme_minimal() +
theme(plot.title = element_text(size = 30, hjust = .5, color = "red4"))
euc_plot <- personality |>
filter(person != "Daniel") |>
mutate(person = if_else(person == "Elizabeth", "Vector A", "Vector B")) |>
ggplot() +
# geom_segment(aes(openness, neuroticism, xend = 0, yend = 0)) +
geom_point(aes(openness, neuroticism, color = person), size = 4) +
geom_segment(
aes(openness, neuroticism, xend = lead(openness), yend = lead(neuroticism)),
linewidth = 2, color = "royalblue", arrow = arrow(ends = "both")
) +
annotate("text", label = "4.24",
x = 5, y = 4.1, color = "royalblue", fontface = "bold") +
geom_text(aes(openness + .7, neuroticism + .5, label = person)) +
scale_color_manual(
breaks = c("Vector A", "Vector B"),
values = c("forestgreen", "purple")
) +
guides(color = "none") +
coord_fixed(xlim = c(0, 8), ylim = c(0, 7)) +
labs(title = "Euclidean Distance") +
theme_minimal() +
theme(plot.title = element_text(size = 30, hjust = .5, color = "royalblue"))
euc_plot + cos_plot
```
A nice thing about the cosine is that it is always between -1 and 1: When the two vectors are pointing in a similar direction, the cosine is close to 1, and when they are pointing in a near-opposite direction (180°), the cosine is close to -1.
Looking at the above visualization, you might wonder: Why should the angle be fixed at the zero point? What does the zero point have to do with anything? If you wondered this, good job. The reason: **Cosine similarity works best when your vector space is centered at zero (or close to it)**. In other words, it works best when zero represents a medium level of each variable. This fact is sometimes taken for granted because, in practice, many vector spaces are already centered at zero. For example, word embeddings trained with word2vec, GloVe, and related models (@sec-word-embeddings) can be assumed to center at zero given sufficiently diverse training data because their training is based on the dot products between embeddings (the dot product is a close cousin of cosine similarity). The ubiquity of zero-centered vector spaces makes cosine similarity a very useful tool. Even so, not all vector spaces are zero-centered, so take a moment to consider the nature of your vector space before deciding which similarity or distance metric to use.
The formula for calculating cosine similarity might look a bit complicated:
$$
Cosine(A,B) = \frac{A \cdot B}{|A||B|} = \frac{\sum _{i=1}^{n} A_{i}B_{i}}{\sqrt {\sum _{i=1}^{n} A_{i}^2} \cdot \sqrt {\sum _{i=1}^{n} B_{i}^2}}
$$
In R though, it's pretty simple. Let's calculate the cosine similarity between Elizabeth and each of the other people in our sample. To make sure the vector space is centered at zero, we will subtract 4 from each value (the scales all range from 1 to 7).
```{r}
# cosine similarity function
cos_sim <- function(x, y){
dot <- x %*% y
normx <- sqrt(sum(x^2))
normy <- sqrt(sum(y^2))
as.vector( dot / (normx*normy) )
}
# center at 0
eliza_vec_centered <- eliza_vec - 4
personality_sim <- personality |>
mutate(across(extraversion:neuroticism, ~.x - 4))
# distance between Elizabeth and each person
personality_sim <- personality_sim |>
rowwise() |>
mutate(
similarity_to_eliza = cos_sim(c_across(extraversion:neuroticism), eliza_vec_centered)
)
personality_sim
```
Once again, we see that the most similar person to Elizabeth is Elizabeth herself, with a cosine similarity of 1. The next closest, as before, is Daniel.
If you are comfortable with cosines, you might be happy with the explanation we have given so far. Nevertheless, it might be helpful to consider the relationship between cosine similarity and a more familiar statistic that ranges between -1 and 1: the Pearson correlation coefficient (i.e. regular old correlation). Cosine similarity measures the similarity between two _vectors_, while the correlation coefficient measures the similarity between two _variables_. Now just imagine our vectors as variables, with each dimension as an observation. Since we only compare two vectors at a time with cosine similarity, let's start with Elizabeth and Amos:
```{r}
#| echo: false
#| warning: false
personality |>
filter(person != "Daniel") |>
pivot_longer(extraversion:neuroticism, names_to = "dimension") |>
pivot_wider(names_from = "person") |>
ggplot(aes(Amos, Elizabeth, label = dimension)) +
geom_point(size = 3) +
geom_text(aes(Amos + .3, Elizabeth + .3)) +
coord_equal(xlim = c(1, 8), ylim = c(4, 9)) +
theme_minimal()
```
Now imagine centering those variables at zero, like this:
```{r}
#| echo: false
#| warning: false
personality |>
filter(person != "Daniel") |>
pivot_longer(extraversion:neuroticism, names_to = "dimension") |>
pivot_wider(names_from = "person") |>
mutate(across(Amos:Elizabeth, ~.x - mean(.x))) |>
ggplot(aes(Amos, Elizabeth, label = dimension)) +
geom_hline(yintercept = 0, color = "grey") +
geom_vline(xintercept = 0, color = "grey") +
geom_point(size = 3) +
geom_text(aes(Amos + .3, Elizabeth + .3)) +
coord_equal(xlim = c(-4, 4), ylim = c(-3, 3)) +
theme_minimal()
```
When seen like this, the correlation is the same as the cosine similarity. In other words, **the correlation between two vectors is the same as the cosine similarity between them when the values of each vector are centered at zero**.^[For proof, see @oconnor_2012] Seeing cosine similarity as the non-centered version of correlation might give you extra intuition for why cosine similarity works best for vector spaces that are centered at zero.
## Word Counts as Vector Space {#sec-word-count-vectors}
The advantage of thinking in vector space is that we can quantify similarities and differences even without understanding what any of the dimensions in the vector space are measuring. In the coming chapters, we will introduce methods that require this kind of relational thinking, since the dimensions of the vector space are abstract statistical contrivances. Even so, any collection of variables can be thought of as dimensions in a vector space. You might, for example, use distance or similarity metrics to analyze groups of word counts.
**An example of word counts as relational vectors in research:** @ireland_pennebaker_2010 asked students to answer essay questions written in different styles. They then calculated dictionary-based word counts for both the questions and the answers using 9 linguistic word lists from LIWC (see @sec-dictionary-sources), including personal pronouns (e.g. "I", "you"), and articles (e.g, "a", "the"). They treated these 9 word counts as a 9-dimensional vector for each text, and measured the similarity between questions and responses with a metric similar to Euclidean distance. They found that students automatically matched the linguistic style of the questions (i.e. answers were more similar to the question they were answering than to other questions) and that women and students with higher grades matched their answers especially closely to the style of the questions.
---