Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 35ba77a
Showing
29 changed files
with
628 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
^.*\.Rproj$ | ||
^\.Rproj\.user$ | ||
^\.gitignore | ||
NEWS.md | ||
FAQ.md | ||
NEWS.html | ||
FAQ.html | ||
^\.travis\.yml$ | ||
travis-tool.sh | ||
inst/web | ||
contributors.geojson | ||
inst/build.R | ||
^.*\.Rprofile$ | ||
README.Rmd | ||
README.R | ||
travis.yml | ||
inst/gofastr_logo | ||
inst/staticdocs | ||
inst/extra_statdoc | ||
inst/maintenance.R | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# History files | ||
.Rhistory | ||
|
||
# Example code in package build process | ||
*-Ex.R | ||
|
||
.Rprofile | ||
.Rproj.user | ||
gofastr.Rproj | ||
inst/maintenance.R |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
language: c | ||
|
||
sudo: required | ||
before_install: | ||
- curl -OL http://raw.github.com/craigcitro/r-travis/master/scripts/travis-tool.sh | ||
- chmod 755 ./travis-tool.sh | ||
- ./travis-tool.sh bootstrap | ||
install: | ||
- sh -e /etc/init.d/xvfb start | ||
- ./travis-tool.sh aptget_install r-cran-xml | ||
- ./travis-tool.sh install_github hadley/devtools | ||
- ./travis-tool.sh install_deps | ||
- ./travis-tool.sh github_package jimhester/covr | ||
script: ./travis-tool.sh run_tests | ||
after_success: | ||
- Rscript -e 'library(covr);coveralls()' | ||
notifications: | ||
email: | ||
on_success: change | ||
on_failure: change | ||
env: | ||
global: | ||
- R_BUILD_ARGS="--resave-data=best" | ||
- R_CHECK_ARGS="--as-cran" | ||
- DISPLAY=:99.0 | ||
- BOOTSTRAP_LATEX=1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
Package: gofastr | ||
Title: Fast DocumentTermMatrix and TermDocumentMatric Creation | ||
Version: 0.0.1 | ||
Authors@R: c(person("Tyler", "Rinker", email = "tyler.rinker@gmail.com", role = c("aut", "cre"))) | ||
Maintainer: Tyler Rinker <tyler.rinker@gmail.com> | ||
Description: Harness the power of 'data.table' and 'stringi' to quickly generate 'tm' DocumentTermMatrix and TermDocumentMatrix data structures. | ||
Depends: R (>= 3.2.2) | ||
Suggests: testthat | ||
Imports: data.table (>= 1.9.5), slam, stringi, tm | ||
Date: 2015-09-08 | ||
License: GPL-2 | ||
LazyData: TRUE | ||
Roxygen: list(wrap = FALSE) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Generated by roxygen2 (4.1.1): do not edit by hand | ||
|
||
S3method(remove_stopwords,DocumentTermMatrix) | ||
S3method(remove_stopwords,TermDocumentMatrix) | ||
export(q_dtm) | ||
export(q_tdm) | ||
export(remove_stopwords) | ||
importFrom(data.table,":=") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
NEWS | ||
==== | ||
|
||
Versioning | ||
---------- | ||
|
||
Releases will be numbered with the following semantic versioning format: | ||
|
||
<major>.<minor>.<patch> | ||
|
||
And constructed with the following guidelines: | ||
|
||
* Breaking backward compatibility bumps the major (and resets the minor | ||
and patch) | ||
* New additions without breaking backward compatibility bumps the minor | ||
(and resets the patch) | ||
* Bug fixes and misc changes bumps the patch | ||
|
||
|
||
gofastr 0.0.1 | ||
---------------------------------------------------------------- | ||
|
||
This package is... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
#' Fast DocumentTermMatrix and TermDocumentMatric Creation | ||
#' | ||
#' This package does one thing...It harness the power of \pkg{data.table} and | ||
#' \pkg{stringi} to quickly generate \pkg{tm} \code{\link[tm]{TermDocumentMatrix}} | ||
#' and \code{\link[tm]{DocumentTermMatrix}} data structures without creating a | ||
#' \code{\link[tm]{Corpus}} first. | ||
#' @docType package | ||
#' @name gofastr | ||
#' @aliases gofastr package-gofastr | ||
NULL | ||
|
||
#' 2012 U.S. Presidential Debates | ||
#' | ||
#' A dataset containing a cleaned version of all three presidential debates for | ||
#' the 2012 election. | ||
#' | ||
#' @details | ||
#' \itemize{ | ||
#' \item person. The speaker | ||
#' \item tot. Turn of talk | ||
#' \item dialogue. The words spoken | ||
#' \item time. Variable indicating which of the three debates the dialogue is from | ||
#' } | ||
#' | ||
#' @docType data | ||
#' @keywords datasets | ||
#' @name presidential_debates_2012 | ||
#' @usage data(presidential_debates_2012) | ||
#' @format A data frame with 2912 rows and 4 variables | ||
NULL |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
#' Quick DocumentTermMatrix | ||
#' | ||
#' Make a \code{\link[tm]{DocumentTermMatrix}} from a vector of text and and | ||
#' optional vector of documents. | ||
#' | ||
#' @param text A vector of strings. | ||
#' @param docs An optional vector of document names. | ||
#' @param weighting A \pkg{tm} weighting: \code{\link[tm]{weightTf}}, | ||
#' \code{\link[tm]{weightTfIdf}}, \code{\link[tm]{weightBin}}, or | ||
#' \code{\link[tm]{weightSMART}}. | ||
#' @param \ldots Additional arguments passed to \code{\link[tm]{as.DocumentTermMatrix}}. | ||
#' @return Returns a \code{\link[tm]{DocumentTermMatrix}}. | ||
#' @keywords dtm DocumentTermMatrix | ||
#' @export | ||
#' @importFrom data.table := | ||
#' @examples | ||
#' with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_"))) | ||
q_dtm <- function(text, docs = seq_along(text), weighting = tm::weightTf, ...){ | ||
. <- x <- y <- NULL | ||
dat <- data.table::data.table(y = stringi::stri_trans_tolower(text), x = docs)[, | ||
y := stringi::stri_extract_all_words(y)][, .(y = unlist(y)), by = x][!is.na(y),] | ||
out <- suppressMessages(data.table::dcast(dat, x ~ y, fun=length, drop=FALSE, fill=0)) | ||
out2 <- as.matrix(out[, -1, with = FALSE]) | ||
row.names(out2) <- out[[1]] | ||
tm::as.DocumentTermMatrix(slam::as.simple_triplet_matrix(out2), weighting = weighting, ...) | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
#' Quick TermDocumentMatrix | ||
#' | ||
#' Make a \code{\link[tm]{TermDocumentMatrix}} from a vector of text and and | ||
#' optional vector of documents. | ||
#' | ||
#' @param text A vector of strings. | ||
#' @param docs An optional vector of document names. | ||
#' @param weighting A \pkg{tm} weighting: \code{\link[tm]{weightTf}}, | ||
#' \code{\link[tm]{weightTfIdf}}, \code{\link[tm]{weightBin}}, or | ||
#' \code{\link[tm]{weightSMART}}. | ||
#' @param \ldots Additional arguments passed to \code{\link[tm]{as.TermDocumentMatrix}}. | ||
#' @return Returns a \code{\link[tm]{TermDocumentMatrix}}. | ||
#' @keywords tdm TermDocumentMatrix | ||
#' @importFrom data.table := | ||
#' @export | ||
#' @examples | ||
#' with(presidential_debates_2012, q_tdm(dialogue, paste(time, tot, sep = "_"))) | ||
q_tdm <- function(text, docs = seq_along(text), weighting = tm::weightTf, ...){ | ||
. <- x <- y <- NULL | ||
dat <- data.table::data.table(y = stringi::stri_trans_tolower(text), x = docs)[, | ||
y := stringi::stri_extract_all_words(y)][, .(y = unlist(y)), by = x][!is.na(y),] | ||
out <- suppressMessages(data.table::dcast(dat, y~x, fun=length, drop=FALSE, fill=0)) | ||
out2 <- as.matrix(out[, -1, with = FALSE]) | ||
row.names(out2) <- out[[1]] | ||
tm::as.TermDocumentMatrix(slam::as.simple_triplet_matrix(out2), weighting = weighting, ...) | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
#' Remove Stopwords from a TermDocumentMatrix/DocumentTermMatrix | ||
#' | ||
#' Remove stopwords and < nchar words from a \code{\link[tm]{TermDocumentMatrix}} | ||
#' or \code{\link[tm]{DocumentTermMatrix}}. | ||
#' | ||
#' @param x A \code{\link[tm]{TermDocumentMatrix}} or \code{\link[tm]{DocumentTermMatrix}}. | ||
#' @param stopwords A vector of stopwords to remove. | ||
#' @param min.char The minial length character for retained words. | ||
#' @return Returns a \code{\link[tm]{TermDocumentMatrix}} or \code{\link[tm]{DocumentTermMatrix}}. | ||
#' @keywords stopwords | ||
#' @export | ||
#' @examples | ||
#' (x <-with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))) | ||
#' remove_stopwords(x) | ||
#' (y <- with(presidential_debates_2012, q_tdm(dialogue, paste(time, tot, sep = "_")))) | ||
#' remove_stopwords(y) | ||
remove_stopwords <- function(x, stopwords = tm::stopwords("english"), min.char = 3) { | ||
UseMethod("remove_stopwords") | ||
} | ||
|
||
#' @export | ||
#' @method remove_stopwords TermDocumentMatrix | ||
remove_stopwords.TermDocumentMatrix <- function(x, stopwords = tm::stopwords("english"), min.char = 3) { | ||
|
||
if (!is.null(stopwords)){ | ||
x <- x[!rownames(x) %in% stopwords, ] | ||
} | ||
if (!is.null(min.char)){ | ||
x <- x[nchar(rownames(x)) > min.char - 1, ] | ||
} | ||
x | ||
} | ||
|
||
#' @export | ||
#' @method remove_stopwords DocumentTermMatrix | ||
remove_stopwords.DocumentTermMatrix <- function(x, stopwords = tm::stopwords("english"), min.char = 3) { | ||
|
||
if (!is.null(stopwords)){ | ||
x <- x[, !colnames(x) %in% stopwords] | ||
} | ||
if (!is.null(min.char)){ | ||
x <- x[, nchar(colnames(x)) > min.char - 1] | ||
} | ||
x | ||
} | ||
|
||
|
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
--- | ||
title: "gofastr" | ||
date: "`r format(Sys.time(), '%d %B, %Y')`" | ||
output: | ||
md_document: | ||
toc: true | ||
--- | ||
|
||
```{r, echo=FALSE} | ||
desc <- suppressWarnings(readLines("DESCRIPTION")) | ||
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)" | ||
loc <- grep(regex, desc) | ||
ver <- gsub(regex, "\\2", desc[loc]) | ||
verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver) | ||
```` | ||
|
||
[![Build Status](https://travis-ci.org/trinker/gofastr.svg?branch=master)](https://travis-ci.org/trinker/gofastr) | ||
[![Coverage Status](https://coveralls.io/repos/trinker/gofastr/badge.svg?branch=master)](https://coveralls.io/r/trinker/gofastr?branch=master) | ||
`r verbadge` | ||
|
||
<img src="inst/gofastr_logo/r_gofastr.png" width="150" alt="readability Logo"> | ||
|
||
|
||
**gofastr** is designed to do one thing really well...It harnesses the power of **data.table** and **stringi** to quickly generate **tm** `DocumentTermMatrix` and `TermDocumentMatrix` data structures. | ||
|
||
In my work I often get data in the form of large .csv files. The `Corpus` structure is an unnecessary step that requires additional run time. **gofastr** skips this step and uses **data.table** and **stringi** to quickly make the `DocumentTermMatrix` and `TermDocumentMatrix` data structures directly. | ||
|
||
There are three functions: | ||
|
||
| Function | Description | | ||
|--------------------|-------------------------------------------------------| | ||
| `q_tdm` | `TermDocumentMatrix` from string vector | | ||
| `q_dtm` | `DocumentTermMatrix` from string vector | | ||
| `remove_stopwords` | Remove stopwords and minimal character words from `TermDocumentMatrix`/`DocumentTermMatrix` | | ||
|
||
|
||
# Installation | ||
|
||
To download the development version of **gofastr**: | ||
|
||
Download the [zip ball](https://github.com/trinker/gofastr/zipball/master) or [tar ball](https://github.com/trinker/gofastr/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version: | ||
|
||
```r | ||
if (!require("pacman")) install.packages("pacman") | ||
pacman::p_load_gh("trinker/gofastr") | ||
``` | ||
|
||
# Contact | ||
|
||
You are welcome to: | ||
* submit suggestions and bug-reports at: <https://github.com/trinker/gofastr/issues> | ||
* send a pull request on: <https://github.com/trinker/gofastr/> | ||
* compose a friendly e-mail to: <tyler.rinker@gmail.com> | ||
|
||
|
||
# Examples | ||
|
||
```{r} | ||
library(gofastr) | ||
(x <-with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))) | ||
remove_stopwords(x) | ||
(y <- with(presidential_debates_2012, q_tdm(dialogue, paste(time, tot, sep = "_")))) | ||
remove_stopwords(y) | ||
``` |
Oops, something went wrong.