Skip to content

Commit

Permalink
updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
trinker committed Dec 28, 2016
1 parent b2b6813 commit 9760b3a
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 27 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Description: A small collection of convenience tools for reading text
Depends: R (>= 3.2.2)
Suggests: testthat
Imports: curl, pdftools, readxl, textshape, tools, utils, XML
Date: 2016-12-27
Date: 2016-12-28
License: GPL-2
LazyData: TRUE
Roxygen: list(wrap = FALSE)
Expand Down
3 changes: 2 additions & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,8 @@ pdf_doc %>%
Users may find the following sites useful for OCR in R:

- http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
- https://CRAN.R-project.org/package=tesseract
- http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r
- https://github.com/soodoku/abbyyR


Expand Down
51 changes: 26 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Status](https://coveralls.io/repos/trinker/textreadr/badge.svg?branch=master)](h
**textreadr** is a small collection of convenience tools for reading
text documents into R. This is not meant to be an exhaustive collection;
for more see the
[**tm**]( https://cran.r-project.org/package=tm) package.
[**tm**](https://cran.r-project.org/web/packages/tm/index.html) package.


Table of Contents
Expand Down Expand Up @@ -158,7 +158,7 @@ Here I download a .docx file of presidential debated from 2012.
read_docx() %>%
head(3)

## pres.deb1.docx read into C:\Users\Tyler\AppData\Local\Temp\RtmpSAlo9U
## pres.deb1.docx read into C:\Users\Tyler\AppData\Local\Temp\RtmpeoTOjw

## [1] "LEHRER: We'll talk about -- specifically about health care in a moment. But what -- do you support the voucher system, Governor?"
## [2] "ROMNEY: What I support is no change for current retirees and near-retirees to Medicare. And the president supports taking $716 billion out of that program."
Expand Down Expand Up @@ -255,7 +255,8 @@ caution is useful for those struggling to read image text into R.
Users may find the following sites useful for OCR in R:

- <http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/>
- <https://CRAN.R-project.org/package=tesseract>
- <http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r>
- <https://github.com/soodoku/abbyyR>

Read Transcripts
Expand Down Expand Up @@ -436,26 +437,26 @@ in **textreadr**'s system file:

levelName
pos
¦--0_9.txt
¦--1_7.txt
¦--10_9.txt
¦--11_9.txt
¦--12_9.txt
¦--13_7.txt
¦--14_10.txt
¦--15_7.txt
¦--16_7.txt
¦--17_9.txt
¦--18_7.txt
¦--19_10.txt
¦--2_9.txt
¦--3_10.txt
¦--4_8.txt
¦--5_10.txt
¦--6_10.txt
¦--7_7.txt
¦--8_7.txt
°--9_7.txt
¦--0_9.txt
¦--1_7.txt
¦--10_9.txt
¦--11_9.txt
¦--12_9.txt
¦--13_7.txt
¦--14_10.txt
¦--15_7.txt
¦--16_7.txt
¦--17_9.txt
¦--18_7.txt
¦--19_10.txt
¦--2_9.txt
¦--3_10.txt
¦--4_8.txt
¦--5_10.txt
¦--6_10.txt
¦--7_7.txt
¦--8_7.txt
°--9_7.txt

Here we have read the files in, one row per file.

Expand Down Expand Up @@ -519,7 +520,7 @@ I demonstrate pairings with
textshape::split_index(which(.$loc) -1) %>%
lapply(select, -loc)

## SCDB_2012_01_codebook.pdf read into C:\Users\Tyler\AppData\Local\Temp\RtmpSAlo9U
## SCDB_2012_01_codebook.pdf read into C:\Users\Tyler\AppData\Local\Temp\RtmpeoTOjw

## Function to extract cases
ex_vs <- qdapRegex::ex_(pattern = "((of|[A-Z][A-Za-z'.,-]+)\\s+)+([Vv]s?\\.\\s+)(([A-Z][A-Za-z'.,-]+\\s+)*((of|[A-Z][A-Za-z',.-]+),?($|\\s+|\\d))+)")
Expand Down Expand Up @@ -586,4 +587,4 @@ I demonstrate pairings with
## [3] "United States v. Havens"
## [4] "Parratt v. Taylor"
## [5] "Dougherty County Board of Education v. White"
## [6] "Jenkins v. Anderson"
## [6] "Jenkins v. Anderson"

0 comments on commit 9760b3a

Please sign in to comment.