/
text_hashing_trick.qmd
50 lines (35 loc) · 1.43 KB
/
text_hashing_trick.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
execute:
freeze: true
---
[R/preprocessing.R](https://github.com/rstudio/keras//blob/main/R/preprocessing.R#L234)
# text_hashing_trick
## Converts a text to a sequence of indexes in a fixed-size hashing space.
## Description
Converts a text to a sequence of indexes in a fixed-size hashing space.
## Usage
```r
text_hashing_trick(
text,
n,
hash_function = NULL,
filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n",
lower = TRUE,
split = " "
)
```
## Arguments
|Arguments|Description|
|---|---|
| text | Input text (string). |
| n | Dimension of the hashing space. |
| hash_function | if `NULL` uses the Python `hash()` function. Otherwise can be `'md5'` or any function that takes in input a string and returns an int. Note that `hash` is not a stable hashing function, so it is not consistent across different runs, while `'md5'` is a stable hashing function. |
| filters | Sequence of characters to filter out such as punctuation. Default includes basic punctuation, tabs, and newlines. |
| lower | Whether to convert the input to lowercase. |
| split | Sentence split marker (string). |
## Details
Two or more words may be assigned to the same index, due to possible collisions by the hashing function.
## Value
A list of integer word indices (unicity non-guaranteed).
## See Also
Other text preprocessing: `make_sampling_table()`, `pad_sequences()`, `skipgrams()`, `text_one_hot()`, `text_to_word_sequence()`