Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
e12ac25
commit d2335b9
Showing
1 changed file
with
250 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,250 @@ | ||
--- | ||
title: "Data management in `R`: Session 2" | ||
subtitle: "USC Security and Political Economy Lab" | ||
author: "Therese Anders" | ||
output: | ||
pdf_document: | ||
number_sections: yes | ||
--- | ||
|
||
# Working with data frames in `R` | ||
## Importing data | ||
Most data formats we commonly use are not native to `R` and need to be imported. Luckily, there is a variety of packages available to import just about any non-native format. One of the essential libraries is called `foreign` and includes functions to import .csv, .dta (Stata), .dat, ,.sav (SPSS), etc. The `foreign` package is pre-installed. However, we need to tell `R` that we want to use functions from this package by calling it into the working environment using the `library()` command. | ||
```{r,warning = F} | ||
library(foreign) | ||
``` | ||
|
||
In this example, we will use the data from Imai (2017), that is saved in a .csv file and can be accessed online http://qss.princeton.press/student-files/INTRO/UNpop.csv. Download and save this file on your machine. When reading the data, remember that you need to either specify the complete file name or change your working directory using `setwd()` to the folder in which you saved the data. | ||
```{r} | ||
mydata <- read.csv("UNpop.csv") | ||
class(mydata) | ||
``` | ||
|
||
## Dimensions of a data frame | ||
Let's find out what this data looks like. First, use the `str()` function to explore the variable names and which data class they are stored in. Note: `int` stands for `integer` and is a special case of the class `numeric`. | ||
```{r} | ||
str(mydata) | ||
``` | ||
|
||
|
||
If we are only interested in what the variables are called, we can use the `names()` function. | ||
```{r} | ||
names(mydata) | ||
``` | ||
|
||
As we saw in the last session, we can alter the names of vectors by using the `names()` function and indexing. Because data frames are essentially just combinations of vectors, we can do the same for variable names inside data frames. Suppose we didn't like the `.` in a variable name and wanted to change this to an underscore. | ||
```{r} | ||
names(mydata)[2] <- "world_pop" | ||
names(mydata) | ||
``` | ||
|
||
We can use the `summary()` function to get a first look at the data using measures of location. | ||
```{r} | ||
summary(mydata) | ||
``` | ||
|
||
A data frame has two dimensions: rows and columns. | ||
```{r} | ||
nrow(mydata) # Number of rows | ||
ncol(mydata) # Number of columns | ||
dim(mydata) # Rows first then columns. | ||
``` | ||
|
||
## Accessing elements of a data frame | ||
As a rule, whenever we use two-dimensional indexing in `R`, the order is: `[row,column]`. To access the first row of the data frame, we specify the row we want to see and leave the column slot following the comma empty. | ||
```{r} | ||
mydata[1,] | ||
``` | ||
|
||
We can use the concatenate function `c()` to access multiple rows (or colums) at once. Below we print out the first and second row of the dataframe. | ||
```{r} | ||
mydata[c(1,2),] | ||
``` | ||
|
||
If we try to access a data point that is out of bounds, `R` returns the value `NULL`. Here, there is no third column! | ||
```{r} | ||
mydata[3,3] | ||
``` | ||
|
||
|
||
**Exercise 1** Access the element of the data frame `mydata` that is stored in row 1, column 1. | ||
```{r, echo = F} | ||
mydata[1,1] | ||
``` | ||
|
||
**Exercise 2** Access the element of the data frame `mydata` that is stored in column 2, row 3. | ||
```{r, echo=FALSE} | ||
mydata[3,2] | ||
``` | ||
|
||
### The `$` operator | ||
|
||
The `$` operator in `R` is used to specify a variable within a data frame. This is an alternative to indexing. | ||
```{r} | ||
mydata$year | ||
mydata$world_pop | ||
``` | ||
|
||
**Exercise 3**: How would you access all elements of the variable `year` (first variable) using indexing rather than the `$` operator? | ||
```{r, echo = F} | ||
mydata[,1] | ||
mydata[,"year"] | ||
``` | ||
|
||
**Exercise 4**: Print out every second element from the variable `world_pop` using indexing methods and the sequence function `seq()`. | ||
```{r, echo = F} | ||
mydata$world_pop[seq(1, length(mydata$world_pop), 2)] | ||
``` | ||
|
||
**Exercise 5**: What are two ways to find the mean value of the variable `world_pop` using indexing (i.e. not using the `$` operator)? | ||
```{r, echo = F} | ||
mean(mydata[,2]) | ||
mean(mydata[,"world_pop"]) | ||
``` | ||
|
||
**Exercise 6**: How would you find the maximum world population value using the `$` operator? | ||
```{r, echo = F} | ||
max(mydata$world_pop) | ||
``` | ||
|
||
**Exercise 7**: Print the year that corresponds to the maximum world population value using the `$` operator and indexing! | ||
```{r, echo = F} | ||
mydata$year[mydata$world_pop == max(mydata$world_pop)] | ||
``` | ||
|
||
## NAs in `R` | ||
`NA` is how `R` denotes missing values. For certain functions, `NA`s cause problems. | ||
```{r} | ||
vec <- c(4, 1, 2, NA, 3) | ||
mean(vec) #Result is NA! | ||
sum(vec) #Result is NA! | ||
``` | ||
|
||
We can tell `R` to remove the NA and execute the function on the remainder of the data. | ||
```{r} | ||
mean(vec, na.rm = T) | ||
sum(vec, na.rm = T) | ||
``` | ||
|
||
|
||
## Adding observations | ||
First, lets add another observation to the data. Suppose we wanted to add an observation for the year 2020, that for right now will be a missing value. We can use the same operations we used for vectors to add data. Here, we will will use the `rbind()` function to do so. `rbind()` stands for "row bind." Save the output in a new data frame! | ||
```{r} | ||
obs <- c(2020, NA) | ||
obs | ||
mydata_new <- rbind(mydata, obs) | ||
dim(mydata_new) | ||
``` | ||
|
||
We can also create new variables that use information from the existing data. The population value is expressed in thousands. To express the world population in its original unit of measurement, we multiply each value by the scalar 1000 and store it in a new value called `world_pop_og`. By using the `$` operator, we can directly assign the new variable to the data frame `mydata_new`. | ||
```{r} | ||
mydata_new$world_pop_og <- mydata_new$world_pop * 1000 | ||
head(mydata_new, 3) #prints out the first 3 rows of they data frame | ||
``` | ||
|
||
We can use indexing and logical expressions to compute the population growth rate relative to the value in 1950. The general formula for computing a growth rate is | ||
|
||
$$growth = (new - old)/old * 100.$$ | ||
|
||
To make the code more legible, I will first store the 1950 value in a separate object called `pop1950`. | ||
```{r} | ||
pop1950 <- mydata_new$world_pop[mydata_new$year == 1950] | ||
mydata_new$world_pop_growth1950 <- (mydata_new$world_pop - pop1950)/pop1950 * 100 | ||
``` | ||
|
||
## Saving data | ||
Suppose we wanted to save this newly created data frame. We have multiple options to do so. If we wanted to save it as a native `.RData` format, we would run the following command. | ||
```{r} | ||
# Make sure you specified the right working directory! | ||
save(mydata_new, file = "mydata_new.RData") | ||
``` | ||
|
||
Most of the time, however, we would want to save our data in formats that can be read by other programs as well. .csv is an obvious choice. After saving the new file, check the folder that is your working directory to see whether the newly saved file shows up there! | ||
```{r} | ||
write.csv(mydata_new, file = "mydata_new.csv") | ||
``` | ||
|
||
# (Very basic) data visualization | ||
Today, we will be covering some basics of data visualization in `R` using the native plotting functions. For more advanced data visualization functions, see the `ggplot2` package and the related material for a two-session introductory workshop on data visualization https://github.com/thereseanders/Workshop-Intro-to-ggplot2. | ||
|
||
We will work with data on US states' population, area, and population density values in 2015, derived from https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density. First, read in the data, and take a look at it using the `head()` function. | ||
|
||
```{r} | ||
dat <- read.csv("wiki_us_pop.csv") | ||
head(dat) | ||
``` | ||
|
||
## Basic graphical summaries of data | ||
|
||
| Type |Operator | | ||
|:------|:-----| | ||
|Histogram | `hist()` | | ||
|Stem and leaf plot | `stem()` | | ||
|Boxplot | `boxplot()` | | ||
|Kernel density plot | `plot(density())` | | ||
|Basic scatterplot | `plot()`| | ||
|
||
### Boxplot of population density | ||
First, we get a overview of the population density of US states by using the `boxplot()` function. It seems that there are quite a few outliers when it comes to population density! | ||
```{r, fig.width=3, fig.height = 3} | ||
boxplot(dat$pop_dens) | ||
``` | ||
|
||
Suppose we wanted to know whether larger states on average are more or less densely populated than smaller states. Let's first look at the distribution of the land area (in square km). | ||
```{r} | ||
summary(dat$area) #mean 182949, median 139578 | ||
``` | ||
|
||
Create a new dummy variable that codes whether a state is smaller or larger equal than the **median** state size. | ||
```{r} | ||
median(dat$area) | ||
dat$largestate <- ifelse(dat$area < median(dat$area), 1, 0) | ||
head(dat$largestate) | ||
table(dat$largestate) # We split the observations exactly in half. | ||
``` | ||
|
||
|
||
We can display two indicators in the same boxplot. We can use this feature to answer the question whether larger states are on average more or less dense than smaller states. | ||
```{r, fig.width = 3, fig.height= 3} | ||
boxplot(dat$pop_dens ~ dat$largestate) | ||
``` | ||
|
||
|
||
### Histogram of population density | ||
```{r, fig.width=4, fig.height = 3} | ||
hist(dat$pop_dens) | ||
``` | ||
|
||
### Density plot of population size | ||
```{r, fig.width = 4, fig.height = 3} | ||
plot(density(dat$pop)) | ||
``` | ||
|
||
## Basic scatter plots | ||
How does population size vary with area? Are the correlated at all? | ||
```{r, warning = F, message = F , fig.width = 4, fig.height = 4} | ||
plot(dat$pop,dat$area) | ||
``` | ||
|
||
### Basic graphic options | ||
Aside from arguably not being very informative, the plot above is not very pretty. Lets give it titles, use color and shapes! | ||
|
||
```{r, warning = F, message = F , fig.width = 4, fig.height = 4} | ||
plot(dat$pop, dat$area, | ||
main = "US States (2015)", #Adding a main title. | ||
xlab = "Population", #Adding a x-axis title. | ||
ylab = "Area in sq km", #Adding a y-axis title. | ||
col = "tomato", #Changing the color of the data points. | ||
pch = 18) #Changing the shape | ||
``` | ||
|
||
Yeah, ok. Its not much prettier (especially the labeling on the axes), but you get the point... | ||
|
||
A few additional notes on graphical options: | ||
|
||
* `R` can display any color in the RBG or HEX system. However it also has a ton of colors that you can just refer to by name, see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf. | ||
* Same with the shapes and line types, see http://www.cookbook-r.com/Graphs/Shapes_and_line_types/. | ||
* `R` colors in all their glory: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf. | ||
|
||
# References {-} | ||
Imai, Kosuke (2017): *Quantitative Social Science. An Introduction*. Princeton and Oxford: Princeton University Press. |