# Enumerations: Chapter 1: counting punctuation (notebook version)
This R notebook is an adaptation of code published by Andrew Piper in the [GitHub repo accompanying his 2018 book](https://github.com/piperandrew/enumerations), *Enumerations: Data and Literary Study*.

It was adapted by Quinn Dombrowski, who doesn't know any R, using similar data. Andrew Piper extensively commented the original code, but Quinn restructured it as markdown cells in this notebook, and created the paratextual structure, comments on the source data, and other comments geared towards people who don't really know R.

The functionality of the notebook was made possible by Shawn Graham, who diagnosed some of the errors Quinn encountered (and complained about on Twitter) and found solutions.

## 1. Install and load packages
The code cell below installs the prerequisite R packages. The second code cell loads them in R.

In [None]:
install.packages('tm', repos='http://cran.us.r-project.org')
install.packages('SnowballC', repos='http://cran.us.r-project.org')
install.packages('splitstackshape', repos='http://cran.us.r-project.org')
install.packages('gridExtra', repos='http://cran.us.r-project.org')

In [None]:
library(stringr)
library(tm)
library(SnowballC)
library(zoo)
library(splitstackshape)
library(ggplot2)
library(scales)
library(gridExtra)

## 2. Prepare data
This code assumes that you have a directory full of .txt files that represent a corpus of poetry. The file name for each .txt file should be as follows: *NumericalUniqueID_AuthorLastNameAuthorFirstName_PoemTitle_PublishedYear.txt*. 

For example: `66005_AliAghaShahid_Ghazal_1949.txt`.

Each .txt file in the intended source material has the line number on the line above the text, and the text is indented by a tab. For instance:

`1
	There once was a man from Nantucket
2
	Who kept all his cash in a bucket.
3
	But his daughter, named Nan,
4
	Ran away with a man
5
	And as for the bucket, Nantucket.`
    
You don't necessarily have to structure your text this way, but if you don't, there's some lines below (e.g. `no.lines<-(length(work2)/2)-1` and `work.word.vector<-gsub("\\d", "", work.word.vector) #remove numbers`) that you'd need to modify to accommodate the lack of line numbers.

## 3. Counting punctuation marks

This section counts punctuation marks by word and line (if poems have line breaks) for either a group of works or a single work that has been divided into smaller chunks it does not record all punctuation marks, but limits itself to a small set of primary marks:

- these include: ?, !, ., , ;, : ()

It takes as input a directory of works and outputs a table of marks per word/line

This is the code used to extract punctuation from the following data sets (as defined by Andrew Piper):

- [txtLAB450](https://txtlab.org/2016/01/txtlab450-a-data-set-of-multilingual-novels-for-teaching-and-research/)
- Novel_19C: 3,285 novels in English published between 1800 and 1899, collected by the [Stanford Literary Lab](https://litlab.stanford.edu/)
- Novel_20C: a random selection of 24,400 pairs of pages from novels published since 1950.
- Poetry_19C: 125,675 poems in English written by authors born between 1765 and 1865, drawn from the ProQuest Literature Online Collection.
- Poetry_20C: 75,297 poems in English witten by authors born between 1865 an 1975, drawn from the ProQuest Literature Online Collection.

The code is presented for use on other data sets.

Turn off lines related to "line breaks" for non-poetry.

### 3.1 Load data
Below, put the full path to the directory that contains your corpus of texts, in place of `20CPoetryAll`.

For instance, the path to a Texts folder in the default Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

- On Mac: '/Users/YOUR-USER-NAME/Documents/Texts'
- On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\Texts'

*Note:* The code cell below is modified from the source repo. There were issues in opening the files when `full.names` were set to false. Thanks to Shawn Graham for figuring out the problem.

In [None]:
filenames<-list.files("/Users/qad/20CPoetryAll", pattern="*.txt", full.names=TRUE)

### 3.2 Analyze data
The following cell block analyzes the data, according to the comments embedded in the code.

In [None]:
punctuation.dtm<-NULL
for (i in 1:length(filenames)) {
  #load poem
  work<-scan(filenames[i], what="character", quote="")
  #load second version separating by line breaks
  work2<-scan(filenames[i], what="character", quote="", sep = "\n")
  if (length(work) > 0){
    no.lines<-(length(work2)/2)-1
    punct<-grep("\\...|\\!|\\.|\\,|\\;|\\:|\\(|\\)", work)
    work.punct<-str_extract(work, "\\...|\\!|\\.|\\,|\\;|\\:|\\(|\\)")
    punct2<-work.punct[!is.na(work.punct)]
    #calculate total words
    work.lower<-tolower(work) # all lower case
    work.words<-strsplit(work.lower, "\\W") # turn into a list of words
    work.word.vector<-unlist(work.words) #turn into a vector
    work.word.vector<-gsub("\\d", "", work.word.vector) #remove numbers
    work.word.vector<-work.word.vector[which(work.word.vector!="")]#only keeps parts of vector with words
    total.words<-length(work.word.vector) #total words in the novel
    #frequency of a given punctuation mark
    ellipsis<-length(grep("\\...", punct2))/total.words
    ellipsis.line<-length(grep("\\...", punct2))/no.lines
    exclam<-length(grep("\\!", punct2))/total.words
    exclam.line<-length(grep("\\!", punct2))/no.lines
    period<-length(grep("\\.", punct2))/total.words
    period.line<-length(grep("\\.", punct2))/no.lines
    comma<-length(grep("\\,", punct2))/total.words
    comma.line<-length(grep("\\,", punct2))/no.lines
    total<-length(punct)/total.words
    total.line<-length(punct)/no.lines
    #novel.dtm<-data.frame(filenames[i], total.words,ellipsis, exclam, period, comma, total)
    novel.dtm<-data.frame(filenames[i], total.words, no.lines, ellipsis, ellipsis.line, exclam, exclam.line, period, period.line, comma, comma.line, total, total.line)
    punctuation.dtm<-rbind(punctuation.dtm, novel.dtm)
  }
}

### 3.3 Output results
The following cell block creates an output file with the results of the cell above, named `punctuationoutput.csv`. This file will, by defualt, be created in the same directory as the notebook.

In [None]:
write.csv(punctuation.dtm, file = "punctuationoutput.csv",row.names=FALSE)