Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
173 lines (130 sloc) 6.38 KB
output
pdf_document
cat(paste("(C) (cc by-sa) Wouter van Atteveldt, file generated", format(Sys.Date(), format="%B %d %Y")))

Note on the data used in this howto: This data can be downloaded from http://piketty.pse.ens.fr/files/capital21c/en/xls/, but the excel format is a bit difficult to parse at it is meant to be human readable, with multiple header rows etc. For that reason, I've extracted csv files for some interesting tables that I've uploaded to https://github.com/vanatteveldt/learningr/tree/master/data. If you're accessing this tutorial from the githup project, these files should be in your 'data' sub folder automatically.

Merging and transforming data in R

Let's have a closer look at the relation at the accumulation of capital in various countries. In chapter 5, separate values are introduced for private and public capital:

private.capital = read.csv("data/private_capital.csv")
public.capital = read.csv("data/public_capital.csv")

Combining data using cbind

First, let's combine the two data sets into a single set. This is done using cbind, which binds multiple columns (vectors or data frames) into a single data frame. It only works if the data is alinged correctly and has the same amount of rows, so first we check this:

nrow(private.capital)
nrow(public.capital)
table(private.capital$Year == public.capital$Year)

So they both have 41 rows and in all rows the Year variables are identical. Thus, we can use cbind here, omitting the first column of public.capital.

capital = cbind(private.capital, public.capital[-1])
head(capital)

Renaming variables

As could be seen in the head output, the column names are identical for both data sets, which causes a problem. So, let's prepend the last 10 column names with public. using the paste function, which pastes two texts together.

colnames(capital)[11:19] = paste("public", colnames(capital[11:19]), sep=".")
head(capital)

/Note that using merge rather than cbind would be more natural and would prevent these problems, but it is good to know how cbind and renaming works in any case/

In order to rename variables more easily in a 'manual' fashion, the rename command from the plyr package can be used.

Note, if you get an error about missing a package, you can install it using the install.packages function (or you can use the 'Packages' panel in the bottom right in RStudio):

install.packages("plyr")  # only needed the first time

Now, we can use library(plyr) to load the package, and use the rename command:

library(plyr)
capital = rename(capital, c("U.K."="UK", "U.S."="US", "public.U.K."="public.UK", "public.U.S."="public.US"))
head(capital)

Variables to cases using melt

Now assume that we want to calculate the total capital (private plus public) per country. One way to do that would be to simply sum the relevant parts of the data frame:

total.capital = capital[, 2:10] + capital[, 11:19]
head(total.capital)

However, a more general way that will afford more flexibility later is to first transform both datasets from wide into long formats (e.g. convert the variables to cases), and then merge the two sets. The reshape2 package has a function melt that transforms variables to cases based on a list of index variables that identify the cases.

library(reshape2)
public.long = melt(public.capital, id.vars="Year")
private.long = melt(private.capital, id.vars="Year")
head(public.long)

As you can see, we now have one row per year per country. We can now use merge to merge the two data frames. By default, merge merges data on identical column names. Since the value column is also shared between the data frames, and we don't want to merge on that column, we need to specify the by= argument:

capital = merge(public.long, private.long, by=c("Year", "variable"))
head(capital)

So now we have our data ready. Let's rename the column names to something more sensible, and then compute the total capital. We also save the file for later use using the save command.

colnames(capital) = c("Year", "Country", "Public", "Private")
capital$Total = capital$Private + capital$Public
save(capital, file="data/capital.rdata")

Variables to cases

The counterparg of melting is called casting. The dcast function works by using a 'casting formula' rows ~ columns to specify which variables to place in the rows and columns of the resulting data frame.

d = dcast(capital, Year ~ Country, value.var="Total")
head(d)

So now we have a data frame that we can plot, e.g. using a for loop as above. cast can also work to make aggregate functions. Suppose we would want to make a plot per decade rather than per year, we first make a new 'decade' value:

capital$Decade = floor(capital$Year / 10) * 10

A good way to check whether a recode such as this succeeded is to tabulate decade and year:

table(capital$Decade, capital$Year)

Now, we use the new Decade variable in the casting formula, and specify an aggregation function mean:

d = dcast(capital, Decade ~ Country, value.var="Total", fun.aggregate=mean)
head(d)

Aggregate vs. cast

Instead of cast, we can also use the normal'aggregatecommand. Cast and aggregate have slightly different applications. Aggregate allows multiple columns to be aggregated in one command (but all using the same aggregation function), but produces a 'long' data format, whilecast` can only aggregate a single variable, but can directly create a tabular data format. For example, the following aggregates the public, private and total capital per decade and country:

aggregated = aggregate(capital[c("Public", "Private", "Total")], by=capital[c("Decade", "Country")], FUN=mean)
head(aggregated)

Note: the by= argument in aggregate needs to be a list or data frame. Since capital[, "Decade"] returns a vector rather than a data frame, you must use either capital["Decade"] (ommitting the comma) or capital[, "Decade", drop=F]. For example, the first attempt below to aggregate the recent values per country fails because of this problem:

aggregated = aggregate(capital["Total"], by=capital[, "Country"], FUN=mean)
aggregated = aggregate(capital["Total"], by=capital[, "Country", drop=F], FUN=mean)
head(aggregated)