textrecipes

textrecipes contains extra steps for the recipes package for preprocessing text data.

Installation

textrecipes is not avaliable from CRAN yet. But the development version can be downloaded with:

require("devtools")
install_github("emilhvitfeldt/textrecipes")

Example

In the following example we will go through the steps needed to convert a character variable to the TF-IDF of its tokenized words after removing stopwords and limeting ourself to only the 500 most used words. We will be conduction this preprosession on the variable essay0.

library(recipes)
library(textrecipes)

data(okc_text)

okc_rec <- recipe(~ ., data = okc_text) %>%
  add_role(contains("essay"), new_role = "textual") %>%
  step_tokenize(has_role("textual")) %>% # Tokenizes to words by default
  step_stopwords(has_role("textual")) %>% # Uses the english snowball list by default
  step_tokenfilter(has_role("textual"), max_tokens = 100) %>%
  step_tfidf(has_role("textual"))
#> Warning: Changing role(s) for essay0, essay1, essay2, essay3, essay4,
#> essay5, essay6, essay7, essay8, essay9
   
okc_obj <- okc_rec %>%
  prep(training = okc_text)
   
str(bake(okc_obj, okc_text), list.len = 15)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    750 obs. of  1000 variables:
#>  $ tfidf_essay0_also         : num  0 0 0.0213 0.1888 0 ...
#>  $ tfidf_essay0_always       : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_amp          : num  0.457 0.567 0 0 0 ...
#>  $ tfidf_essay0_anything     : num  0 0 0.108 0 0 ...
#>  $ tfidf_essay0_area         : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_around       : num  0 0 0.0327 0 0 ...
#>  $ tfidf_essay0_art          : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_back         : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_bay          : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_believe      : num  0 0 0 0 0.302 ...
#>  $ tfidf_essay0_big          : num  0.0747 0 0 0 0 ...
#>  $ tfidf_essay0_bit          : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_essay0_br           : num  0.0573 0.2665 0.0573 0 0 ...
#>  $ tfidf_essay0_can          : num  0.0406 0 0.0203 0 0 ...
#>  $ tfidf_essay0_city         : num  0 0 0 0 0 0 0 0 0 0 ...
#>   [list output truncated]

Type chart

textrecipes includes a little departure in design from recipes in the sense that it allows some input and output to be in the form of list columns. To avoind confusion here is a table of steps with their expected input and output respectively. Notice how you need to end with numeric for future analysis to work.

Step	Input	Output	Status
`step_tokenize`	character	list-column	Done
`step_untokenize`	list-column	character	Done
`step_stem`	list-column	list-column	Done
`step_stopwords`	list-column	list-column	Done
`step_tokenfilter`	list-column	list-column	Done
`step_tfidf`	list-column	numeric	Done
`step_tf`	list-column	numeric	Done
`step_texthash`	list-column	numeric	Done
`step_word2vec`	character	numeric	TODO

(TODO = Yet to be implemented, bug = correnctly not working, working = the step works but still not finished i.e. missing document/tests/arguemnts, done = finished)

This means that valid sequences includes

recipe(~ ., data = data) %>%
  step_tokenize(text) %>%
  step_stem(text) %>%
  step_stopwords(text) %>%
  step_topwords(text) %>%
  step_tf(text)

# or

recipe(~ ., data = data) %>%
  step_tokenize(text) %>%
  step_stem(text) %>%
  step_tfidf(text)

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
R		R
data		data
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
codecov.yml		codecov.yml
textrecipes.Rproj		textrecipes.Rproj

License

Licenses found

topepo/textrecipes

Folders and files

Latest commit

History

Repository files navigation

textrecipes

Installation

Example

Type chart

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages