Skip to content
Text Interchange Formats
R
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R Fix documentation mismatch Apr 25, 2019
inst/examples preparing compliance with ROpenSci onboarding Mar 14, 2018
man Fix documentation mismatch Apr 25, 2019
tests preparing compliance with ROpenSci onboarding Mar 14, 2018
.Rbuildignore
.gitignore Ignore files in git and Rbuild Apr 25, 2019
CONDUCT.md preparing compliance with ROpenSci onboarding Mar 14, 2018
DESCRIPTION Add links to official docs site. Dec 9, 2019
NAMESPACE Fix documentation mismatch Apr 25, 2019
NEWS.md preparing compliance with ROpenSci onboarding Mar 14, 2018
README.md Update README.md Apr 21, 2018

README.md

tif: Text Interchange Formats

AppVeyor Build Status Travis-CI Build Status

This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept both and return or coerce to at least one of these.

Installation

You can install the development version using devtools:

devtools::install_github("ropensci/tif")

Usage

The package can be used to check that a particular object is in a valid format. For example, here we see that the object corpus is a valid corpus data frame:

library(tif)
corpus <- data.frame(doc_id = c("doc1", "doc2", "doc3"),
                     text = c("Aujourd'hui, maman est morte.",
                      "It was a pleasure to burn.",
                      "All this happened, more or less."),
                     stringsAsFactors = FALSE)

tif_is_corpus_df(corpus)
TRUE

The package also has functions to convert between the list and data frame formats for corpus and token object. For example:

tif_as_corpus_character(corpus)
                              doc1                               doc2 
   "Aujourd'hui, maman est morte."       "It was a pleasure to burn." 
                              doc3 
"All this happened, more or less." 

Note that extra meta data columns will be lost in the conversion from a data frame to a named character vector.

Details

This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept and return at least one of these.

corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

corpus (character vector) - A valid character vector corpus object is an character vector with UTF-8 encoding. If it has names, this should be a unique character also in UTF-8 encoding. No other attributes should be present.

dtm - A valid document term matrix is a sparse matrix with the row representing documents and columns representing terms. The row names is a character vector giving the document ids with no duplicated entries. The column names is a character vector giving the terms of the matrix with no duplicated entries. The sparse matrix should inherit from the Matrix class dgCMatrix.

tokens (data frame) - A valid data frame tokens object is a data frame with at least two columns. There must be a column called doc_id that is a character vector with UTF-8 encoding. Document ids must be unique. There must also be a column called token that must also be a character vector in UTF-8 encoding. Each individual token is represented by a single row in the data frame. Addition token-level metadata columns are allowed but not required.

tokens (list) - A valid corpus tokens object is (possibly named) list of character vectors. The character vectors, as well as names, should be in UTF-8 encoding. No other attributes should be present in either the list or any of its elements.

You can’t perform that action at this time.