Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
trinker committed Sep 9, 2015
0 parents commit 35ba77a
Show file tree
Hide file tree
Showing 29 changed files with 628 additions and 0 deletions.
21 changes: 21 additions & 0 deletions .Rbuildignore
@@ -0,0 +1,21 @@
^.*\.Rproj$
^\.Rproj\.user$
^\.gitignore
NEWS.md
FAQ.md
NEWS.html
FAQ.html
^\.travis\.yml$
travis-tool.sh
inst/web
contributors.geojson
inst/build.R
^.*\.Rprofile$
README.Rmd
README.R
travis.yml
inst/gofastr_logo
inst/staticdocs
inst/extra_statdoc
inst/maintenance.R

10 changes: 10 additions & 0 deletions .gitignore
@@ -0,0 +1,10 @@
# History files
.Rhistory

# Example code in package build process
*-Ex.R

.Rprofile
.Rproj.user
gofastr.Rproj
inst/maintenance.R
26 changes: 26 additions & 0 deletions .travis.yml
@@ -0,0 +1,26 @@
language: c

sudo: required
before_install:
- curl -OL http://raw.github.com/craigcitro/r-travis/master/scripts/travis-tool.sh
- chmod 755 ./travis-tool.sh
- ./travis-tool.sh bootstrap
install:
- sh -e /etc/init.d/xvfb start
- ./travis-tool.sh aptget_install r-cran-xml
- ./travis-tool.sh install_github hadley/devtools
- ./travis-tool.sh install_deps
- ./travis-tool.sh github_package jimhester/covr
script: ./travis-tool.sh run_tests
after_success:
- Rscript -e 'library(covr);coveralls()'
notifications:
email:
on_success: change
on_failure: change
env:
global:
- R_BUILD_ARGS="--resave-data=best"
- R_CHECK_ARGS="--as-cran"
- DISPLAY=:99.0
- BOOTSTRAP_LATEX=1
13 changes: 13 additions & 0 deletions DESCRIPTION
@@ -0,0 +1,13 @@
Package: gofastr
Title: Fast DocumentTermMatrix and TermDocumentMatric Creation
Version: 0.0.1
Authors@R: c(person("Tyler", "Rinker", email = "tyler.rinker@gmail.com", role = c("aut", "cre")))
Maintainer: Tyler Rinker <tyler.rinker@gmail.com>
Description: Harness the power of 'data.table' and 'stringi' to quickly generate 'tm' DocumentTermMatrix and TermDocumentMatrix data structures.
Depends: R (>= 3.2.2)
Suggests: testthat
Imports: data.table (>= 1.9.5), slam, stringi, tm
Date: 2015-09-08
License: GPL-2
LazyData: TRUE
Roxygen: list(wrap = FALSE)
8 changes: 8 additions & 0 deletions NAMESPACE
@@ -0,0 +1,8 @@
# Generated by roxygen2 (4.1.1): do not edit by hand

S3method(remove_stopwords,DocumentTermMatrix)
S3method(remove_stopwords,TermDocumentMatrix)
export(q_dtm)
export(q_tdm)
export(remove_stopwords)
importFrom(data.table,":=")
23 changes: 23 additions & 0 deletions NEWS
@@ -0,0 +1,23 @@
NEWS
====

Versioning
----------

Releases will be numbered with the following semantic versioning format:

<major>.<minor>.<patch>

And constructed with the following guidelines:

* Breaking backward compatibility bumps the major (and resets the minor
and patch)
* New additions without breaking backward compatibility bumps the minor
(and resets the patch)
* Bug fixes and misc changes bumps the patch


gofastr 0.0.1
----------------------------------------------------------------

This package is...
30 changes: 30 additions & 0 deletions R/gofastr-package.R
@@ -0,0 +1,30 @@
#' Fast DocumentTermMatrix and TermDocumentMatric Creation
#'
#' This package does one thing...It harness the power of \pkg{data.table} and
#' \pkg{stringi} to quickly generate \pkg{tm} \code{\link[tm]{TermDocumentMatrix}}
#' and \code{\link[tm]{DocumentTermMatrix}} data structures without creating a
#' \code{\link[tm]{Corpus}} first.
#' @docType package
#' @name gofastr
#' @aliases gofastr package-gofastr
NULL

#' 2012 U.S. Presidential Debates
#'
#' A dataset containing a cleaned version of all three presidential debates for
#' the 2012 election.
#'
#' @details
#' \itemize{
#' \item person. The speaker
#' \item tot. Turn of talk
#' \item dialogue. The words spoken
#' \item time. Variable indicating which of the three debates the dialogue is from
#' }
#'
#' @docType data
#' @keywords datasets
#' @name presidential_debates_2012
#' @usage data(presidential_debates_2012)
#' @format A data frame with 2912 rows and 4 variables
NULL
27 changes: 27 additions & 0 deletions R/q_dtm.R
@@ -0,0 +1,27 @@
#' Quick DocumentTermMatrix
#'
#' Make a \code{\link[tm]{DocumentTermMatrix}} from a vector of text and and
#' optional vector of documents.
#'
#' @param text A vector of strings.
#' @param docs An optional vector of document names.
#' @param weighting A \pkg{tm} weighting: \code{\link[tm]{weightTf}},
#' \code{\link[tm]{weightTfIdf}}, \code{\link[tm]{weightBin}}, or
#' \code{\link[tm]{weightSMART}}.
#' @param \ldots Additional arguments passed to \code{\link[tm]{as.DocumentTermMatrix}}.
#' @return Returns a \code{\link[tm]{DocumentTermMatrix}}.
#' @keywords dtm DocumentTermMatrix
#' @export
#' @importFrom data.table :=
#' @examples
#' with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))
q_dtm <- function(text, docs = seq_along(text), weighting = tm::weightTf, ...){
. <- x <- y <- NULL
dat <- data.table::data.table(y = stringi::stri_trans_tolower(text), x = docs)[,
y := stringi::stri_extract_all_words(y)][, .(y = unlist(y)), by = x][!is.na(y),]
out <- suppressMessages(data.table::dcast(dat, x ~ y, fun=length, drop=FALSE, fill=0))
out2 <- as.matrix(out[, -1, with = FALSE])
row.names(out2) <- out[[1]]
tm::as.DocumentTermMatrix(slam::as.simple_triplet_matrix(out2), weighting = weighting, ...)
}

27 changes: 27 additions & 0 deletions R/q_tdm.R
@@ -0,0 +1,27 @@
#' Quick TermDocumentMatrix
#'
#' Make a \code{\link[tm]{TermDocumentMatrix}} from a vector of text and and
#' optional vector of documents.
#'
#' @param text A vector of strings.
#' @param docs An optional vector of document names.
#' @param weighting A \pkg{tm} weighting: \code{\link[tm]{weightTf}},
#' \code{\link[tm]{weightTfIdf}}, \code{\link[tm]{weightBin}}, or
#' \code{\link[tm]{weightSMART}}.
#' @param \ldots Additional arguments passed to \code{\link[tm]{as.TermDocumentMatrix}}.
#' @return Returns a \code{\link[tm]{TermDocumentMatrix}}.
#' @keywords tdm TermDocumentMatrix
#' @importFrom data.table :=
#' @export
#' @examples
#' with(presidential_debates_2012, q_tdm(dialogue, paste(time, tot, sep = "_")))
q_tdm <- function(text, docs = seq_along(text), weighting = tm::weightTf, ...){
. <- x <- y <- NULL
dat <- data.table::data.table(y = stringi::stri_trans_tolower(text), x = docs)[,
y := stringi::stri_extract_all_words(y)][, .(y = unlist(y)), by = x][!is.na(y),]
out <- suppressMessages(data.table::dcast(dat, y~x, fun=length, drop=FALSE, fill=0))
out2 <- as.matrix(out[, -1, with = FALSE])
row.names(out2) <- out[[1]]
tm::as.TermDocumentMatrix(slam::as.simple_triplet_matrix(out2), weighting = weighting, ...)
}

47 changes: 47 additions & 0 deletions R/remove_stopwords.R
@@ -0,0 +1,47 @@
#' Remove Stopwords from a TermDocumentMatrix/DocumentTermMatrix
#'
#' Remove stopwords and < nchar words from a \code{\link[tm]{TermDocumentMatrix}}
#' or \code{\link[tm]{DocumentTermMatrix}}.
#'
#' @param x A \code{\link[tm]{TermDocumentMatrix}} or \code{\link[tm]{DocumentTermMatrix}}.
#' @param stopwords A vector of stopwords to remove.
#' @param min.char The minial length character for retained words.
#' @return Returns a \code{\link[tm]{TermDocumentMatrix}} or \code{\link[tm]{DocumentTermMatrix}}.
#' @keywords stopwords
#' @export
#' @examples
#' (x <-with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_"))))
#' remove_stopwords(x)
#' (y <- with(presidential_debates_2012, q_tdm(dialogue, paste(time, tot, sep = "_"))))
#' remove_stopwords(y)
remove_stopwords <- function(x, stopwords = tm::stopwords("english"), min.char = 3) {
UseMethod("remove_stopwords")
}

#' @export
#' @method remove_stopwords TermDocumentMatrix
remove_stopwords.TermDocumentMatrix <- function(x, stopwords = tm::stopwords("english"), min.char = 3) {

if (!is.null(stopwords)){
x <- x[!rownames(x) %in% stopwords, ]
}
if (!is.null(min.char)){
x <- x[nchar(rownames(x)) > min.char - 1, ]
}
x
}

#' @export
#' @method remove_stopwords DocumentTermMatrix
remove_stopwords.DocumentTermMatrix <- function(x, stopwords = tm::stopwords("english"), min.char = 3) {

if (!is.null(stopwords)){
x <- x[, !colnames(x) %in% stopwords]
}
if (!is.null(min.char)){
x <- x[, nchar(colnames(x)) > min.char - 1]
}
x
}


Empty file added R/utils.R
Empty file.
64 changes: 64 additions & 0 deletions README.Rmd
@@ -0,0 +1,64 @@
---
title: "gofastr"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
---

```{r, echo=FALSE}
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
````

[![Build Status](https://travis-ci.org/trinker/gofastr.svg?branch=master)](https://travis-ci.org/trinker/gofastr)
[![Coverage Status](https://coveralls.io/repos/trinker/gofastr/badge.svg?branch=master)](https://coveralls.io/r/trinker/gofastr?branch=master)
`r verbadge`

<img src="inst/gofastr_logo/r_gofastr.png" width="150" alt="readability Logo">


**gofastr** is designed to do one thing really well...It harnesses the power of **data.table** and **stringi** to quickly generate **tm** `DocumentTermMatrix` and `TermDocumentMatrix` data structures.

In my work I often get data in the form of large .csv files. The `Corpus` structure is an unnecessary step that requires additional run time. **gofastr** skips this step and uses **data.table** and **stringi** to quickly make the `DocumentTermMatrix` and `TermDocumentMatrix` data structures directly.

There are three functions:

| Function | Description |
|--------------------|-------------------------------------------------------|
| `q_tdm` | `TermDocumentMatrix` from string vector |
| `q_dtm` | `DocumentTermMatrix` from string vector |
| `remove_stopwords` | Remove stopwords and minimal character words from `TermDocumentMatrix`/`DocumentTermMatrix` |


# Installation

To download the development version of **gofastr**:

Download the [zip ball](https://github.com/trinker/gofastr/zipball/master) or [tar ball](https://github.com/trinker/gofastr/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:

```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/gofastr")
```

# Contact

You are welcome to:
* submit suggestions and bug-reports at: <https://github.com/trinker/gofastr/issues>
* send a pull request on: <https://github.com/trinker/gofastr/>
* compose a friendly e-mail to: <tyler.rinker@gmail.com>


# Examples

```{r}
library(gofastr)
(x <-with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_"))))
remove_stopwords(x)
(y <- with(presidential_debates_2012, q_tdm(dialogue, paste(time, tot, sep = "_"))))
remove_stopwords(y)
```

0 comments on commit 35ba77a

Please sign in to comment.