Permalink
Fetching contributors…
Cannot retrieve contributors at this time
255 lines (184 sloc) 9.65 KB
---
title: "Read text files with readtext()"
output:
rmarkdown::html_vignette:
css: mystyle.css
toc: yes
vignette: >
%\VignetteIndexEntry{Readtextfiles}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE,
comment = "##")
```
```{r eval=TRUE, message = FALSE}
# Load readtext package
library(readtext)
```
# 1. Introduction
The vignette walks you through importing a variety of different text files into R using the **readtext** package. Currently, **readtext** supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF and Microsoft Word formatted files (.pdf, .doc, .docx).
**readtext** also handles multiple files and file types using for instance a "glob" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz). Usually, you do not have to determine the format of the files explicitly - **readtext** takes this information from the file ending.
The **readtext** package comes with a data directory called `extdata` that contains examples of all files listed above. In the vignette, we use this data directory.
```{r}
# Get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")
```
The `extdata` directory contains several subfolders that include different text files. In the following examples, we load one or more files stored in each of these folders. The `paste0` command is used to concatenate the `extdata` folder from the
**readtext** package with the subfolders. When reading in custom text files, you will need to determine your own data directory (see `?setwd()`).
# 2. Reading one or more text files
## 2.1 Plain text files (.txt)
The folder "txt" contains a subfolder named UDHR with .txt files of the Universal Declaration of Human Rights in 13 languages.
```{r}
# Read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
```
We can specify document-level metadata (`docvars`) based on the file names or on a separate data.frame. Below we take the docvars from the filenames (`docvarsfrom = "filenames"`) and set the names for each variable (`docvarnames = c("unit", "context", "year", "language", "party")`). The command `dvsep = "_"` determines the separator (a regular expression character string) included in the filenames to delimit the `docvar` elements.
```{r}
# Manifestos with docvars from filenames
readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
dvsep = "_",
encoding = "ISO-8859-1")
```
**readtext** can also curse through subdirectories. In our example, the folder `txt/movie_reviews` contains two subfolders (called `neg` and `pos`). We can load all texts included in both folders.
```{r}
# Recurse through subdirectories
readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"))
```
## 2.2 Comma- or tab-separated values (.csv, .tab, .tsv)
Read in comma separated values (.csv files) that contain textual data. We determine the `texts` variable in our .csv file as the `text_field`. This is the column that contains the actual text. The other columns of the original csv file (`Year`, `President`, `FirstName`) are by default treated as document-level variables.
```{r}
# Read in comma-separated values
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
```
The same procedure applies to tab-separated values.
```{r}
# Read in tab-separated values
readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")
```
## 2.3 JSON data (.json)
You can also read .json data. Again you need to specify the `text_field`.
```{r}
## Read in JSON data
readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")
```
## 2.4 PDF files
**readtext** can also read in and convert .pdf files.
In the example below we load all .pdf files stored in the `UDHR` folder, and determine that the `docvars` shall be taken from the filenames. We call the document-level variables `document` and `language`, and specify the delimiter (`dvsep`).
```{r}
## Read in Universal Declaration of Human Rights pdf files
(rt_pdf <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filenames",
docvarnames = c("document", "language"),
sep = "_"))
```
## 2.5 Microsoft Word files (.doc, .docx)
Microsoft Word formatted files are converted through the package **antiword** for older `.doc` files, and using **XML** for newer `.docx` files.
```{r}
## Read in Word data (.docx)
readtext(paste0(DATA_DIR, "/word/*.docx"))
```
## 2.6 Text from URLs
You can also read in text directly from a URL.
```{r}
# Note: Example required: which URL should we use?
```
## 2.7 Text from archive files (.zip, .tar, .tar.gz, .tar.bz)
Finally, it is possible to include text from archives.
```{r}
# Note: Archive file required. The only zip archive included in readtext has
# different encodings and is difficult to import (see section 4.2).
```
# 3. Inter-operability with quanteda
**readtext** was originally developed in early versions of the [**quanteda**](http://github.com/quanteda/quanteda) package for the quantitative analysis of textual data. It was spawned from the `textfile()` function from that package, and now lives exclusively in **readtext**. Because **quanteda**'s corpus constructor recognizes the data.frame format returned by `readtext()`, it can construct a corpus directly from a `readtext` object, preserving all docvars and other meta-data.
```{r, message = FALSE}
require(quanteda)
```
You can easily construct a corpus from a **readtext** object.
```{r}
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
```
# 4. Solving common problems
## 4.1 Remove page numbers using regular expressions
When a document contains page numbers, they are imported as well. If you want to remove them, you can use a regular expression. We strongly recommend using the [**stringi**](http://github.com/gagolews/stringi) package. For the most common regular expressions you can look at this [cheatsheet](http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf).
You first need to check in the original file in which format the page numbers occur (e.g., "1", "-1-", "page 1" etc.). We can make use of the fact that page numbers are almost always preceded and followed by a linebreak (`\n`). After loading the text with **readtext**, you can replace the page numbers.
```{r, message = FALSE}
# Load stringi package
require(stringi)
```
In the first example, the page numbers have the format "page X".
```{r}
# Make some text with page numbers
sample_text_a <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus,
page 1
with the newspaper from a boy named quick Seamus, in his mouth.
page 2
The quicker brown fox jumped over 2 lazy dogs."
sample_text_a
# Remove "page" and respective digit
sample_text_a2 <- unlist(stri_split_fixed(sample_text_a, '\n'), use.names = FALSE)
sample_text_a2 <- stri_replace_all_regex(sample_text_a2, "page \\d*", "")
sample_text_a2 <- stri_trim_both(sample_text_a2)
sample_text_a2 <- sample_text_a2[sample_text_a2 != '']
stri_paste(sample_text_a2, collapse = '\n')
```
In the second example we remove page numbers which have the format "- X -".
```{r}
sample_text_b <- "The quick brown fox named Seamus
- 1 -
jumps over the lazy dog also named Seamus, with
- 2 -
the newspaper from a boy named quick Seamus, in his mouth.
- 33 -
The quicker brown fox jumped over 2 lazy dogs."
sample_text_b
sample_text_b2 <- unlist(stri_split_fixed(sample_text_b, '\n'), use.names = FALSE)
sample_text_b2 <- stri_replace_all_regex(sample_text_b2, "[-] \\d* [-]", "")
sample_text_b2 <- stri_trim_both(sample_text_b2)
sample_text_b2 <- sample_text_b2[sample_text_b2 != '']
stri_paste(sample_text_b2, collapse = '\n')
```
Such **stringi** functions can also be applied to **readtext** objects.
## 4.2 Read files with different encodings
Sometimes files of the same type have different encodings. If the encoding of a file is included in the file name, we can extract this information and import the texts correctly.
```{r}
# create a temporary directory to extract the .zip file
FILEDIR <- tempdir()
# unzip file
unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = FILEDIR)
```
Here, we will get the encoding from the filenames themselves.
```{r}
# get encoding from filename
filenames <- list.files(FILEDIR, "^(Indian|UDHR_).*\\.txt$")
head(filenames)
# Strip the extension
filenames <- gsub(".txt$", "", filenames)
parts <- strsplit(filenames, "_")
fileencodings <- sapply(parts, "[", 3)
head(fileencodings)
# Check whether certain file encodings are not supported
notAvailableIndex <- which(!(fileencodings %in% iconvlist()))
fileencodings[notAvailableIndex]
```
If we read the text files without specifying the encoding, we get erroneously formatted text. To avoid this, we determine the `encoding` using the character object `fileencoding` created above.
We can also add `docvars` based on the filenames.
```{r}
txts <- readtext(paste0(DATA_DIR, "/data_files_encodedtexts.zip"),
encoding = fileencodings,
docvarsfrom = "filenames",
docvarnames = c("document", "language", "input_encoding"))
print(txts, n = 50)
```
From this file we can easily create a **quanteda** `corpus` object.
```{r}
corpus_txts <- corpus(txts)
summary(corpus_txts, 5)
```