3_organizing.Rmd

---
output: pdf_document
---
```{r, echo=FALSE}
cat(paste("(C) (cc by-sa) Wouter van Atteveldt, file generated", format(Sys.Date(), format="%B %d %Y")))
head = function(...) knitr::kable(utils::head(...))
```

> Note on the data used in this howto: 
> This data can be downloaded from http://piketty.pse.ens.fr/files/capital21c/en/xls/, 
> but the excel format is a bit difficult to parse at it is meant to be human readable, with multiple header rows etc. 
> For that reason, I've extracted csv files for some interesting tables that I've uploaded to 
> https://github.com/vanatteveldt/learningr/tree/master/data.
> If you're accessing this tutorial from the githup project, these files should be in your 'data' sub folder automatically.

Organizing data in R
====================

This hands-on demonstrates reading, writing, and manipulating data in R.
As before, we will continue using the data from Piketty's 'Capital in the 21st Century' 


```{r}
income = read.csv("data/income_toppercentile.csv")
```


Saving and loading data
-----

So far, we've used the `read.csv` command to read data from a CSV file.
As can be guessed, there is also a `write.csv` command that writes data into a CSV file:

```{r}
write.csv(income, file="test.csv")
test = read.csv("test.csv")
head(test)
```

A new column was created because by default `write.csv` also writes the row numbers
(you can check this by opening test.csv in excel).
Since this row number column has no header, it is given the variable name `X`.
You can suppress this by adding `row.names=F` to the write.csv function:

```{r}
write.csv(income, file="test.csv", row.names=F)
```

On european computers, excel produces (and expects) csv files to be delimited with semicolons rather then commas by default,
using the comma as a decimal separator (instead of period).
To facilitate this, R provides a pair of functions `read.csv2`/`write.csv2` that use this format. 

If you open a CSV file using the wrong function, you will only see a single column with all the values in it.
For example, if we  use `read.csv2` to  open the file we just created we get the following:

```{r, echo=F}
head = utils::head
```

```{r}
d = read.csv2("test.csv")
head(d)
```

The bottom line is: when using CSV data, always check your results, and use the 'European' version of the commands when appropriate.

Apart from writing csv files, R can also write to a native file format, which has the advantage of correctly storing all types of data (including numbers and date columns) and of storing multiple variables in one file.

For example, the following code stores the incomep and a new `x` variable in a file called `mydata.rdata`:

```{r}
x = 12
save(income, x, file="mydata.rdata")
```

Now, you can clear the data from your environment, using the Clear button in RStudio or by issuing the somewhat cryptic command `rm(list=ls())`

```{r, error=TRUE}
rm(list=ls())
head(income)
```

```{r, echo=F}
head = function(...) knitr::kable(utils::head(...))
```

And if you load the file, the variables will appear again:

```{r}
load("mydata.rdata")
head(income)
```

Note that you do not load the file into a specific variable, as the file can contain multiple variables. 
The load command will automatically create those variables with their original names. 

Subsetting data
-----

The data we have downloaded into `income` contains income series from 1900 to 2010 for a number of countries. 
We can use hard brackets `[rows, columns]` to subset this dataset, for example to select only the first 10 rows or to only select the US and Franch data. 

```{r}
income[1:10, ]
subset = income[, c("US", "France")]
head(subset)
```

A more common use case is that we want to select based on specific criteria. 
Suppose that we are now only interested in the series for the US, and France since 1945. 
We can place an expression in the rows selector to subset the data like that:

```{r}
subset = income[income$Year > 1945, c("Year", "US", "France")]
head(subset)
```

An easy 'shortcut' to subset data by selecting only certain rows is by using the `subset` function.
This function allows you to specify criteria using column names directly, i.e. without using the `dataset$` prefix:

```{r}
postwar = subset(income, Year > 1945)
head(postwar)
```


Ordering data
----

The easiest way to order data is using the `arrange` command from the `plyr` package. 
If you haven't installed `plyr` yet, you can install it with:

```{r, eval=F}
install.packages("plyr")
```

Now, you can use the arrange command which, like the subset command, allows you to use columns directly.
For example, this code orders the income dataset by ascending inequality for France:

```{r}
library(plyr)
income = arrange(income, France)
head(income)
```

So, the years with the most equal income distribution in France are in the 1980's, after which inequality was again on the increase.

You can also specify multiple columns by just adding extra arguments. 
The named argument `decreasing=TRUE` makes it easy to sort in descending order.
Finally, note that you can always using `plyr::rearrange` without first using the `library(plyr)` command.
Thus, the following code sorts by decreasing inequality in Canada and then by decreasing year:

```{r}
income = plyr::arrange(income, Canada, Year, decreasing=T)
head(income)
```

Calculating columns
----

We saw earlier that you can store the result of a calculation in a new variable.
You can also create a new column by storing the result of a calculation in a column.
For example, we could create an column for the average of US and French inequality:

```{r}
subset$average = (subset$US + subset$France) / 2
head(subset)
```

It is also possible to replace part of a column. 
For example, we can set the average to NA when the French value is lower than 0.09 like so:

```{r}
subset$average[subset$France < 0.09] = NA
head(subset)
```

What you are doing there is in fact assigning `NA` to a subset of the column, selected using the France column. 
Becoming good at R for a large part means becoming good at using the subsetting and assignment operations,
so take some time to understand and play around with this code.

Dealing with Missing Values
----

Finally, a useful function is `is.na`. This function is true when it's argument is NA (i.e., missing):

```{r}
is.na(subset$average)
```

As you can see, it is true for the thrid row and for most rows past the 23d. 
In fact, an expression lik `subset$average > 3` also returns such a vector of logical values:

```{r}
subset$US > .11
```

This result is `TRUE` for those years where the income inequality in the US is larger than .11. 
Just as we can use `subset$France < 0.09` to selectively replace certain cells, we can do so with `is.na`:

```{r}
subset$average[is.na(subset$average)] = 0
head(subset)
```

This command tells R to replace every cell in the average column where the average is missing with zero. 
Since sometimes NA values are really zero, this is quite a useful command. 
We can also use this to remove NA rows, similar to the na.omit command used earlier but more flexible.
Let's first introduce our NA's again:

```{r}
subset$average[subset$France < 0.09] = NA
head(subset)
```

And now use `!is.na` to select certain rows in the data frame (an exclamation mark (read as NOT) inverts a selection)

```{r}
subset.nomissing = subset[!is.na(subset$average), ]
head(subset.nomissing)
```

As you can see, row 49 is gone. Note the trailing comma in the subset command. Although we only want to select on rows (and not on columns), we still need to place a comma after the row selection to complete the `[rows, columns]` pattern.

In fact, you can also use selections on a whole data frame, allowing you to replace all values under a certain condition.

```{r}
subset[subset < .11] = NA
head(subset, n=10)
```

Note that here the trailing comma is not given since the selection is based on the whole data set, not just on certain rows.
Similarly, the is.na function can be used to globally replace NA values in a data frame:

```{r}
subset[is.na(subset)] = 0
head(subset)
```