Skip to content
Extra recipes for Text Processing
R
Branch: master
Clone or download
Pull request Compare This branch is 82 commits behind tidymodels:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data
man
tests
vignettes
.Rbuildignore
.gitignore
.travis.yml
DESCRIPTION
LICENSE
LICENSE.md
NAMESPACE
NEWS.md
README.Rmd
README.md
codecov.yml
textrecipes.Rproj

README.md

textrecipes

Travis build status Coverage status CRAN_Status_Badge Downloads lifecycle

textrecipes contains extra steps for the recipes package for preprocessing text data.

Installation

textrecipes is not avaliable from CRAN yet. But the development version can be downloaded with:

require("devtools")
install_github("emilhvitfeldt/textrecipes")

Example

In the following example we will go through the steps needed to convert a character variable to the TF-IDF of its tokenized words after removing stopwords and limeting ourself to only the 500 most used words. We will be conduction this preprosession on the variable essay0.

library(recipes)
library(textrecipes)

data(okc_text)

okc_rec <- recipe(~ ., data = okc_text) %>%
  add_role(contains("essay"), new_role = "textual") %>%
  step_tokenize(has_role("textual")) %>% # Tokenizes to words by default
  step_stopwords(has_role("textual")) %>% # Uses the english snowball list by default
  step_tokenfilter(has_role("textual"), max_tokens = 100) %>%
  step_tfidf(has_role("textual"))
#> Warning: Changing role(s) for essay0, essay1, essay2, essay3, essay4,
#> essay5, essay6, essay7, essay8, essay9
   
okc_obj <- okc_rec %>%
  prep(training = okc_text)
   
str(bake(okc_obj, okc_text), list.len = 15)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    750 obs. of  1000 variables:
#>  $ tfidf_essay0_also         : num  0 0 0.0213 0.1888 0 ...
#>  $ tfidf_essay0_always       : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_amp          : num  0.457 0.567 0 0 0 ...
#>  $ tfidf_essay0_anything     : num  0 0 0.108 0 0 ...
#>  $ tfidf_essay0_area         : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_around       : num  0 0 0.0327 0 0 ...
#>  $ tfidf_essay0_art          : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_back         : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_bay          : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_believe      : num  0 0 0 0 0.302 ...
#>  $ tfidf_essay0_big          : num  0.0747 0 0 0 0 ...
#>  $ tfidf_essay0_bit          : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_essay0_br           : num  0.0573 0.2665 0.0573 0 0 ...
#>  $ tfidf_essay0_can          : num  0.0406 0 0.0203 0 0 ...
#>  $ tfidf_essay0_city         : num  0 0 0 0 0 0 0 0 0 0 ...
#>   [list output truncated]

Type chart

textrecipes includes a little departure in design from recipes in the sense that it allows some input and output to be in the form of list columns. To avoind confusion here is a table of steps with their expected input and output respectively. Notice how you need to end with numeric for future analysis to work.

Step Input Output Status
step_tokenize character list-column Done
step_untokenize list-column character Done
step_stem list-column list-column Done
step_stopwords list-column list-column Done
step_tokenfilter list-column list-column Done
step_tfidf list-column numeric Done
step_tf list-column numeric Done
step_texthash list-column numeric Done
step_word2vec character numeric TODO

(TODO = Yet to be implemented, bug = correnctly not working, working = the step works but still not finished i.e. missing document/tests/arguemnts, done = finished)

This means that valid sequences includes

recipe(~ ., data = data) %>%
  step_tokenize(text) %>%
  step_stem(text) %>%
  step_stopwords(text) %>%
  step_topwords(text) %>%
  step_tf(text)

# or

recipe(~ ., data = data) %>%
  step_tokenize(text) %>%
  step_stem(text) %>%
  step_tfidf(text)
You can’t perform that action at this time.