-
Notifications
You must be signed in to change notification settings - Fork 187
/
Copy pathtext2vec.Rmd
102 lines (75 loc) · 3.47 KB
/
text2vec.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
title: "Replication: word embedding (gloVe/word2vec)"
author: Kenneth Benoit, Haiyan Wang and Kohei Watanabe
output:
html_document:
toc: true
---
```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = FALSE, comment = "##")
```
```{r, message = FALSE}
library("quanteda")
```
This is intended to show how **quanteda** can be used with the [**text2vec**](http://text2vec.org) package in order to replicate its [gloVe example](http://text2vec.org/glove.html).
## Download the corpus
Download a corpus comprising the texts used in the [**text2vec** vignette](http://text2vec.org/glove.html):
```{r eval = TRUE}
wiki_corp <- quanteda.corpora::download(url = "https://www.dropbox.com/s/9mubqwpgls3qi9t/data_corpus_wiki.rds?dl=1")
```
## Select features
First, we tokenize the corpus, and then get the names of the features that occur five times or more.
Trimming the features before constructing the fcm:
```{r}
wiki_toks <- tokens(wiki_corp)
feats <- dfm(wiki_toks, verbose = TRUE) |>
dfm_trim(min_termfreq = 5) |>
featnames()
# leave the pads so that non-adjacent words will not become adjacent
wiki_toks <- tokens_select(wiki_toks, feats, padding = TRUE)
```
## Construct the feature co-occurrence matrix
```{r}
wiki_fcm <- fcm(wiki_toks, context = "window", count = "weighted", weights = 1 / (1:5), tri = TRUE)
```
## Fit word embedding model
Fit the [GloVe model](https://nlp.stanford.edu/pubs/glove.pdf) using [**rsparse**](http://text2vec.org).
```{r, message = FALSE}
library("text2vec")
```
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
GloVe encodes the ratios of word-word co-occurrence probabilities, which is thought to represent some crude form of meaning associated with the abstract concept of the word, as vector difference. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence.
```{r, cache=TRUE}
glove <- GlobalVectors$new(rank = 50, x_max = 10)
wv_main <- glove$fit_transform(wiki_fcm, n_iter = 10,
convergence_tol = 0.01, n_threads = 8)
dim(wv_main)
```
## Averaging learned word vectors
The two vectors are main and context. According to the Glove paper, averaging the two word vectors results in more accurate representation.
```{r}
wv_context <- glove$components
dim(wv_context)
word_vectors <- wv_main + t(wv_context)
```
## Examining term representations
Now we can find the closest word vectors for `paris - france + germany`
```{r}
berlin <- word_vectors["paris", , drop = FALSE] -
word_vectors["france", , drop = FALSE] +
word_vectors["germany", , drop = FALSE]
library("quanteda.textstats")
cos_sim <- textstat_simil(x = as.dfm(word_vectors), y = as.dfm(berlin),
method = "cosine")
head(sort(cos_sim[, 1], decreasing = TRUE), 5)
```
Here is another example for `london = paris - france + uk + england`
```{r}
london <- word_vectors["paris", , drop = FALSE] -
word_vectors["france", , drop = FALSE] +
word_vectors["uk", , drop = FALSE] +
word_vectors["england", , drop = FALSE]
cos_sim <- textstat_simil(as.dfm(word_vectors), y = as.dfm(london),
margin = "documents", method = "cosine")
head(sort(cos_sim[, 1], decreasing = TRUE), 5)
```