Permalink
Cannot retrieve contributors at this time
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
113 lines (78 sloc)
3.31 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Working with multi-word expressions" | |
author: "Kohei Watanabe" | |
output: | |
html_document: | |
toc: true | |
--- | |
```{r, echo = FALSE} | |
knitr::opts_chunk$set( | |
collapse = FALSE, | |
comment = "##" | |
) | |
``` | |
**quanteda** has the functionality to select, remove or compound multi-word expressions, such as phrasal verbs ("try on", "wake up" etc.) and place names ("New York", "South Korea" etc.). | |
```{r, message = FALSE} | |
library(quanteda) | |
``` | |
# Tokenize | |
```{r} | |
toks <- tokens(data_corpus_inaugural) | |
``` | |
# Define multi-word expressions | |
Functions for tokens objects take a character vector, a dictionary or collocations as `pattern`. All those three can be used for multi-word expressions, but you have to be aware their differences. | |
## Character vector | |
The most basic way to define multi-word expressions is separating words by whitespaces and wrap the character vector by `phrase()`. | |
```{r} | |
multiword <- c("United States", "New York") | |
``` | |
### Keyword-in-context | |
`kwic()` is useful to find multi-word expressions in tokens. If you are not sure if "United" and "States" are separated, check their positions (e.g. "434:435"). | |
```{r} | |
head(kwic(toks, pattern = phrase(multiword))) | |
``` | |
### Select tokens | |
Similarly, you can select or remove multi-word expression using `tokens_select()`. | |
```{r} | |
head(tokens_select(toks, pattern = phrase(multiword))) | |
``` | |
### Compound tokens | |
`tokens_compound()` joins elements of multi-word expressions by underscore, so they become "United_States" and "New_York". | |
```{r} | |
comp_toks <- tokens_compound(toks, pattern = phrase(multiword)) | |
head(tokens_select(comp_toks, pattern = c("United_States", "New_York"))) | |
``` | |
## Dictionary | |
Elements of multi-word expressions should be separately by whitespaces in a dictionary, but you do not use `phrase()` here. | |
```{r} | |
multiword_dict <- dictionary(list(country = "United States", | |
city = "New York")) | |
``` | |
### Lookup dictionary | |
```{r} | |
head(tokens_lookup(toks, dictionary = multiword_dict)) | |
``` | |
## Collocations | |
With `textstat_collocations()`, it is possible to discover multi-word expressions through statistical scoring of the associations of adjacent words. | |
### Discover collocations | |
If `textstat_collocations()` is applied to a tokens object comprised only of capitalize words, it usually returns multi-word proper names. | |
```{r} | |
library("quanteda.textstats") | |
col <- toks %>% | |
tokens_remove(stopwords("en")) %>% | |
tokens_select(pattern = "^[A-Z]", valuetype = "regex", | |
case_insensitive = FALSE, padding = TRUE) %>% | |
textstat_collocations(min_count = 5, tolower = FALSE) | |
head(col) | |
``` | |
### Compound collocations | |
Collocations are automatically recognized as multi-word expressions by `tokens_compound()` in *case-sensitive fixed pattern matching*. This is the fastest way to compound large numbers of multi-word expressions, but make sure that `tolower = FALSE` in `textstat_collocations()` to do this. | |
```{r} | |
comp_toks2 <- tokens_compound(toks, pattern = col) | |
head(kwic(comp_toks2, pattern = c("United_States", "New_York"))) | |
``` | |
You can use `phrase()` on collocations if more flexibility is needed. This is usually the case if you compound tokens from different corpus. | |
```{r} | |
comp_toks3 <- tokens_compound(toks, pattern = phrase(col$collocation)) | |
head(kwic(comp_toks3, pattern = c("United_States", "New_York"))) | |
``` | |