## Subsetting Vectors

If we want to subset (or extract) one or several values from a vector, we must provide one or several indices in square brackets. For this example, we will use the state data, which is built into R and includes data related to the 50 states of the U.S.A. Type ?state to see the included datasets. state.name is a built in vector in R of all U.S. states:

In [1]:
state.name

In [2]:
state.name[1]     #first element
state.name[13]    #13th element

What would happen if your index was out of bounds?

In [3]:
state.name[51]

You can use the : colon to create a vector of consecutive numbers.

In [4]:
state.name[1:5]     #first 5 elements
state.name[6:20]    #elements 6-20 

If the numbers are not consecutive, you must use the c() function:

In [6]:
state.name[c(1, 10, 20)]

We can also repeat the indices to create an object with more elements than the original one:

In [5]:
state.name[c(1, 2, 3, 2, 1, 3)]

> NOTE : R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

## Conditional subsetting
Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

The example below is greating a vector that contains the first 5 states ("Alabama","Alaska","Arizona","Arkansas","California"). The line below has changed the logical

In [7]:
five_states <- state.name[1:5]
five_states[c(TRUE, FALSE, TRUE, FALSE, TRUE)]

Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. state.area is a vector of state areas in square miles. We can use the < operator to return a logical vector with TRUE for the indices that meet the condition:

In [8]:
state.area < 10000

In [0]:
state.area[state.area < 10000]

The first expression gives us a logical vector of length 50, where TRUE represents those states with areas less than 10,000 square miles. The second expression subsets state.name to include only those names where the value is TRUE.

You can also specify character values. state.region gives the region that each state belongs to:

In [0]:
state.region == "Northeast"

In [0]:
state.name[state.region == "Northeast"]

Again, a TRUE/FALSE index of all 50 states where the region is the Northeast, followed by a subset of state.name to return only those TRUE values.

Sometimes you need to do multiple logical tests (think Boolean logic). You can combine multiple tests using | (at least one of the conditions is true, OR) or & (both conditions are true, AND). Use help(Logic) to read the help file.

In [0]:
state.name[state.area < 10000 | state.region == "Northeast"]

In [0]:
state.name[state.area < 10000 & state.region == "Northeast"]

The first result includes both states with fewer than 10,000 sq. mi. and all states in the Northeast. New York, Pennsylvania, Delaware and Maine have areas with greater than 10,000 square miles, but are in the Northeastern U.S. Hawaii is not in the Northeast, but it has fewer than 10,000 square miles. The second result includes only states that are in the Northeast and have fewer than 10,000 sq. mi.

R contains a number of operators you can use to compare values. Use help(Comparison) to read the R help file. Note that two equal signs (==) are used for evaluating equality (because one equals sign (=) is used for assigning variables).

A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious. The function %in% allows you to test if any of the elements of a search vector are found:

In [0]:
west_coast <- c("California", "Oregon", "Washington")
state.name[state.name == "California" | state.name == "Oregon" | state.name == "Washington"]

In [0]:
state.name %in% west_coast

In [0]:
state.name[state.name %in% west_coast]

## Missing Data

As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA. R functions have special actions when they encounter NA.

When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. As we saw above, you can add the argument na.rm=TRUE to calculate the result while ignoring the missing values.

In [0]:
rooms <- c(2, 1, 1, NA, 4)
mean(rooms)

In [0]:
max(rooms)

In [0]:
mean(rooms, na.rm = TRUE)

In [0]:
max(rooms, na.rm = TRUE)

If your data include missing values, you may want to become familiar with the functions is.na(), na.omit(), and complete.cases(). See below for examples.

In [0]:
## Use any() to check if any values are missing
any(is.na(rooms))

In [0]:
## Use table() to tell you how many are missing vs. not missing
table(is.na(rooms))

In [0]:
## Identify those elements that are not missing values.
complete.cases(rooms)

In [0]:
## Identify those elements that are missing values.
is.na(rooms)

In [0]:
## Extract those elements that are not missing values.
rooms[complete.cases(rooms)]

You can also use !is.na(rooms), which is exactly the same as complete.cases(rooms). The exclamation mark indicates logical negation.

In [0]:
!c(TRUE, FALSE)

How you deal with missing data in your analysis is a decision you will have to make–do you remove it entirely? Do you replace it with zeros? That will depend on your own methodological questions.

## Key Points
    - Use the assignment operator <- to assign values to objects. You can now manipulate that object in R
    - R contains a number of functions you use to do something with your data. Functions automate more complicated sets of commands. Many functions are predefined, or can be made available by importing R packages
    - A vector is a sequence of elements of the same type. All data in a vector must be of the same type–character, numeric (or double), integer, and logical. Create vectors with c(). Use [ ] to subset values from vectors.