# Community Area Population from a pdf File

by Luc Anselin (anselin@uchicago.edu) (8/22/2016)

Pulling the data for the Chicago Community Area 2010 population from a pdf file, available
at http://www.cityofchicago.org/city/en/depts/dcd/supp_info/community_area_2000and2010censuspopulationcomparisons.html.
This link is to a pdf file that contains a table with the neighborhood ID, the neighborhood name, the populations
for 2010 and 2000, the difference between the two years and the percentage difference.


Note: this is written with R beginners in mind, more seasoned R users can probably skip most of the comments.

For more extensive details about each function, see the R (or RStudio) help files.

Packages used:

- **pdftools**

### Extracting the content from a pdf file

A pdf file is difficult to handle as a source of data, since it doesn't contain tags like an html file.
We will use the **pdftools** package that allows us to turn the contents of a pdf file into a list of long character strings,
one for each page of the document. This packages is not installed by default, so you may have to use
**install.packages("pdftools")** if it is not installed.

The resulting data structure is somewhat complex and not necessarily easy to parse. However, in our case, the table has such a simple structure that we can extract the population values by doing some sleuthing on which columns
contain those values. This will illustrate the power of the various parsing and text extraction functions available in R.

We start by turning the pdf into a list of text strings, and then organize that list so that it only contains the table entries for the 77 community area neighborhoods.



#### Reading the pdf file

We use the **pdf_text** function from the **pdftools** to turn the pdf file into a list of character strings, one
for each page.

In [1]:
library(pdftools)
dat <- pdf_text("Census_2010_and_2000_CA_Populations.pdf")

We check the contents of the **dat** object.

In [4]:
dat

The **dat** object has two entries, one for each page. Each entry is a string. So, when you check the length of the item, it may be surprising that its **length** is only 1. That is because the underlying structure is unknown, it is simply a collection of characters. 

In [5]:
length(dat[[1]])

#### Turning each line in the file into a list

We can carry this out one step at a time, but in order to reach some level of abstraction, we turn it
into a loop. First, we initialize the neighborhood list (**nnlist**) with an empty character [first line below].
Next comes the loop for values of the index **i** going from 1 to 2 (recall that the list has only two elements, one
for each page). Since each element is just one long string, we use the **strsplit** string split command to separate
the long string into a list of one string for each line, by using the return character **\n** as the separator
[line 3 in the code snippet below]. We then extract the first element of the resulting list
using the double bracket notation (this is a side effect of the way lists are organized -- if this seems strange, check the R intro document). We subsequently strip the first four lines from this list (these lines do not contain data -- of course the only way we know this is by carefully checking
the structure of the pdf file).

To streamline the resulting data structure (again, a special characteristic of lists) we turn it into a simple
vector by means of **unlist**. This then allows us to concatenate the result to the current **nnlist** (initially,
just an empty character, after the first step it contains the empty character and the first page, and at the
end it has the empty character, the first and the second page).

In [6]:
nnlist <- ""
for (i in 1:2) {
  ppage <- strsplit(dat[[i]],split="\n")
  nni <- ppage[[1]]
  nni <- nni[-(1:4)]
  nnu <- unlist(nni)
  nnlist <- c(nnlist,nnu)
}
length(nnlist)

The resulting list has 79 elements. Now, we still need to strip the first (empty) element, and the last
element, which is nothing but the totals. We thus extract the elements from **2** to **length - 1**.

In [7]:
nnlist <- nnlist[2:(length(nnlist)-1)]
length(nnlist)

The resulting vector consists of 77 elements that are each a string corresponding to a line in the table.

In [8]:
nnlist[1:3]

### Extracting the population values

We first initialize a vector of zeros to hold the population values. It is the preferred approach to 
initialize a vector first if one knows its size, rather than having it grow by appending rows or columns.
We use the **vector** command and specify the **mode** to **numeric** and give the **length** as the
length of the list.

In [9]:
nnpop <- vector(mode="numeric",length=length(nnlist))
nnpop

We again will use a loop to process each element of the list (each line of the table) one by one.
We use the **substr** command to extract the characters between position 27 and 39 (these values
were determined after taking a careful look at the structure of the table). However, there is still a problem, since
the population values contain commas. We now do two things in one line of code. First, we use **gsub**
to substitute the character **,** by an empty**""**. We turn the result into a numeric value by
means of **as.numeric**. We then assign this number to position **i** of the vector. The resulting
vector **nnpop** contains the population for each of the community areas.

In [10]:
for (i in (1:length(nnlist))) {
     popchar <- substr(nnlist[i],start=27,stop=39)
     popval <- as.numeric(gsub(",","",popchar))
     nnpop[i] <- popval
}
nnpop

### Creating a data frame

In addition to the vector of the population values, we also need a vector of ID values. Since the community
area indicators are simple sequence numbers, we create such a vector to serve as the ID.

In [11]:
nnid <- (1:length(nnlist))
nnid

We turn the vectors **nnid** and **nnpop** into a data frame using the **data.frame** command.
Since the variable names assigned automatically are not that informative, we force them to
**NID** and **POP2010** using the **names** command. Also, as we did before, we make sure the ID variable
is an integer (for merging in GeoDa) by means of **as.integer( )**.

In [12]:
neighpop <- data.frame(as.integer(nnid),nnpop)
neighpop

as.integer.nnid.,nnpop
1,54991
2,71942
3,56362
4,39493
5,31867
6,94368
7,64116
8,80484
9,11187
10,37023


In [13]:
names(neighpop) <- c("NID","POP2010")
neighpop

NID,POP2010
1,54991
2,71942
3,56362
4,39493
5,31867
6,94368
7,64116
8,80484
9,11187
10,37023


### Create a csv output file

We write the contents of the data frame to a csv file. As before, we use the **row.names=FALSE** option to
avoid the extraneous first column in the output file.

In [14]:
write.csv(neighpop,"Community_Pop.csv",row.names=FALSE)