## R Prep Minicourse
### Week 9: Matrices, Factors, and Data Frames

**Credits:** [Datacamp's Introduction to R Course](https://campus.datacamp.com/courses/free-introduction-to-r)

#### Recap

In [1]:
# Poker winnings from Monday to Friday
poker_vector <- c(140, -50, 20, -120, 240)

# Assign days as names of poker_vector
names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

# Total winnings with poker
total_poker <- sum(poker_vector)
total_poker

In [2]:
# Check which days you won money
poker_wins <- poker_vector[poker_vector > 0]
poker_wins

In [3]:
# How much do you earn on average in winning days?
mean(poker_wins)

### Matrices

In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.

You can construct a matrix in R with the `matrix()` function. Consider the following example:

`matrix(1:9, byrow = TRUE, nrow = 3)`

In the `matrix()` function:

- The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use `1:9` which is a shortcut for `c(1, 2, 3, 4, 5, 6, 7, 8, 9)`.
- The argument `byrow` indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place `byrow = FALSE`.
- The third argument `nrow` indicates that the matrix should have three rows.

In [4]:
matrix(1:9, byrow = TRUE, nrow = 3)

0,1,2
1,2,3
4,5,6
7,8,9


Similar to vectors, you can add names for the rows and the columns of a matrix

`rownames(my_matrix) <- row_names_vector`

`colnames(my_matrix) <- col_names_vector`
`

In [5]:
# Box office Star Wars (in millions!)
new_hope <- c(461.0, 314.4)
empire_strikes <- c(290.5, 247.9)
return_jedi <- c(309.3, 165.8)

# Construct matrix
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)

# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

# Name the columns with region
colnames(star_wars_matrix) <- region

# Name the rows with titles
rownames(star_wars_matrix) <- titles

# Print out star_wars_matrix
print(star_wars_matrix)

                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8


In [6]:
# This also works:

box_office <- c(461.0, 314.4, 290.5, 247.9, 309.3, 165.8)
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                           dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), c("US", "non-US")))
print(star_wars_matrix)

                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8


In R, the function `rowSums()` conveniently calculates the totals for each row of a matrix. This function creates a new vector:

In [7]:
# Calculate worldwide box office figures
worldwide_vector <- rowSums(star_wars_matrix)

print(worldwide_vector)

             A New Hope The Empire Strikes Back      Return of the Jedi 
                  775.4                   538.4                   475.1 


You can add a column or multiple columns to a matrix with the `cbind()` function, which merges matrices and/or vectors together by column. For example:

`big_matrix <- cbind(matrix1, matrix2, vector1 ...)`



In [8]:
# Bind the new variable worldwide_vector as a column to star_wars_matrix
all_wars_matrix <- cbind(star_wars_matrix, Total = worldwide_vector)

print(all_wars_matrix)

                           US non-US Total
A New Hope              461.0  314.4 775.4
The Empire Strikes Back 290.5  247.9 538.4
Return of the Jedi      309.3  165.8 475.1


Whereas `cbind()` can paste columns together, `rbind()` does the same thing, but with rows.

In [9]:
box_office2 <- c(474.5, 552.5, 310.7, 338.7, 380.3, 468.5)
star_wars_matrix2 <- matrix(box_office2, nrow = 3, byrow = TRUE,
                           dimnames = list(c("The Phantom Menace", "Attack of the Clones", "Revenge of the sith"), c("US", "non-US")))

# Combine both Star Wars trilogies in one matrix
all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)

print(all_wars_matrix)

                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
The Phantom Menace      474.5  552.5
Attack of the Clones    310.7  338.7
Revenge of the sith     380.3  468.5


Similarly, we also have a `colSums()` function.

In [10]:
# Total revenue for US and non-US
total_revenue_vector <- colSums(all_wars_matrix)

print(total_revenue_vector)

    US non-US 
2226.3 2087.8 


Similar to vectors, you can use the square brackets `[  ]` to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:

- `my_matrix[1,2]` selects the element at the first row and second column.
- `my_matrix[1:3,2:4]` results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.

If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:

- `my_matrix[,1]` selects all elements of the first column.
- `my_matrix[1,]` selects all elements of the first row.

In [11]:
# Select the non-US revenue for all movies
non_us_all <- all_wars_matrix[,2]

# Average non-US revenue
print(mean(non_us_all))

# Select the US revenue for first two movies
us_some <- all_wars_matrix[1:2, 1]

# Average non-US revenue for first two movies
print(mean(us_some))

[1] 347.9667
[1] 375.75


In [12]:
# You can also use arithmetic operators with matrices.
# They will apply to every element of the matrix.

# Assume the ticket price was $5. Estimate the visitors:
visitors <- all_wars_matrix / 5

# Print the estimate to the console
print(visitors, digits=2)

                        US non-US
A New Hope              92     63
The Empire Strikes Back 58     50
Return of the Jedi      62     33
The Phantom Menace      95    110
Attack of the Clones    62     68
Revenge of the sith     76     94


In [13]:
# Or if you have more granular information on the ticket prices...
ticket_prices_matrix <- matrix(c(5, 5, 6, 6, 7, 7, 4, 4, 4.5, 4.5, 4.9, 4.9), nrow = 6, byrow = TRUE)

visitors2 <- all_wars_matrix / ticket_prices_matrix
print(visitors2, digits=2)

                         US non-US
A New Hope               92     63
The Empire Strikes Back  48     41
Return of the Jedi       44     24
The Phantom Menace      119    138
Attack of the Clones     69     75
Revenge of the sith      78     96


##### Summary:

- `matrix(1:9, byrow = TRUE, nrow = 3)`
- `colnames`, `rownames`
- `colSums`, `rowSums`
- `cbind`, `rbind`
- `[row_index, col_index]`

### Factors

The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.

To create factors in R, you make use of the function `factor()`. First thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories. The function `factor()` will encode the vector as a factor:


In [14]:
# Gender vector
gender_vector <- c("Male", "Female", "Female", "Male", "Non-Binary")

# Convert gender_vector to a factor
factor_gender_vector <- factor(gender_vector)

# Print out factor_gender_vector
print(factor_gender_vector)

[1] Male       Female     Female     Male       Non-Binary
Levels: Female Male Non-Binary


Sometimes, you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function `levels()`:

`levels(factor_vector) <- c("name1", "name2",...)`

In [15]:
# Code to build factor_survey_vector
survey_vector <- c("M", "F", "F", "M", "NB")
factor_survey_vector <- factor(survey_vector)

# Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female", "Male", "Non-Binary")

factor_survey_vector

`summary()` gives you a quick overview of the contents of a variable:

In [16]:
# Generate summary for survey_vector
summary(survey_vector)

# Generate summary for factor_survey_vector
summary(factor_survey_vector)

   Length     Class      Mode 
        5 character character 

##### Summary:

- `factor_vector <- factor(vector)`
- `levels(factor_vector) <- c("name1", "name2",...)`
- `summary()`

### Data Frames

A data frame has the variables of a data set as columns and the observations as rows. Contrary to matrices, data frames can hold elements of different types.

Let's import one built-in data frame to use as an example. The function `data()` allows you to load a data set. `?x`, which is equivalent to `help(x)`, shows the documentation associated with the variable x (if it exists).

In [17]:
data(mtcars)

?mtcars

0,1
mtcars {datasets},R Documentation

0,1,2
"[, 1]",mpg,Miles/(US) gallon
"[, 2]",cyl,Number of cylinders
"[, 3]",disp,Displacement (cu.in.)
"[, 4]",hp,Gross horsepower
"[, 5]",drat,Rear axle ratio
"[, 6]",wt,Weight (1000 lbs)
"[, 7]",qsec,1/4 mile time
"[, 8]",vs,V/S
"[, 9]",am,"Transmission (0 = automatic, 1 = manual)"
"[,10]",gear,Number of forward gears


The function `head()` enables you to show the first observations of a data frame. Similarly, the function `tail()` prints out the last observations in your data set.

In [18]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


Another method that is often used to get a rapid overview of your data is the function `str()`. The function `str()` shows you the structure of your data set. For a data frame, it tells you:

- The total number of observations (e.g. 32 car types)
- The total number of variables (e.g. 11 car features)
- A full list of the variables names (e.g. `mpg`, `cyl`)
- The data type of each variable (e.g. `num`)
- The first observations



Applying the `str()` function will often be the first thing that you do when receiving a new data set or data frame. It is a great way to get more insight in your data set before diving into the real analysis.

In [19]:
str(mtcars)

'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...


Now, let's imagine you want to construct a data frame that describes the main characteristics of eight planets in our solar system. According to your good friend Buzz, the main features of a planet are:

- The type of planet (Terrestrial or Gas Giant).
- The planet's diameter relative to the diameter of the Earth.
- The planet's rotation across the sun relative to that of the Earth.
- If the planet has rings or not (TRUE or FALSE).

You construct a data frame with the `data.frame()` function. As arguments, you pass vectors: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length. But don't forget that it is possible (and likely) that they contain different types of data.

In [20]:
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
planets_df

name,type,diameter,rotation,rings
Mercury,Terrestrial planet,0.382,58.64,False
Venus,Terrestrial planet,0.949,-243.02,False
Earth,Terrestrial planet,1.0,1.0,False
Mars,Terrestrial planet,0.532,1.03,False
Jupiter,Gas giant,11.209,0.41,True
Saturn,Gas giant,9.449,0.43,True
Uranus,Gas giant,4.007,-0.72,True
Neptune,Gas giant,3.883,0.67,True


In [21]:
str(planets_df)

'data.frame':	8 obs. of  5 variables:
 $ name    : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
 $ type    : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
 $ diameter: num  0.382 0.949 1 0.532 11.209 ...
 $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
 $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...


Similar to vectors and matrices, you select elements from a data frame with the help of square brackets `[  ]`.

In [22]:
# Print out diameter of Mercury (row 1, column 3)
planets_df[1,3]

# Print out data for Mars (entire fourth row)
planets_df[4,]

Unnamed: 0,name,type,diameter,rotation,rings
4,Mars,Terrestrial planet,0.532,1.03,False


Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.

In [23]:
# Select first 5 values of diameter column
planets_df[1:5, "diameter"]

You will often want to select an entire column, namely one specific variable from a data frame. If your columns have names, you can use the `$` sign:

In [24]:
# Select the rings variable from planets_df
rings_vector <- planets_df$rings
  
# Print out rings_vector
rings_vector

# What are the planetse with rings?
planets_df[rings_vector, ]

Unnamed: 0,name,type,diameter,rotation,rings
5,Jupiter,Gas giant,11.209,0.41,True
6,Saturn,Gas giant,9.449,0.43,True
7,Uranus,Gas giant,4.007,-0.72,True
8,Neptune,Gas giant,3.883,0.67,True


Now, let us move up one level and use the function `subset()`. You should see the `subset()` function as a short-cut to do exactly the same as what you did in the previous exercises.

`subset(my_df, subset = some_condition)`

The first argument of `subset()` specifies the data set for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.

In [25]:
subset(planets_df, subset = rings)

Unnamed: 0,name,type,diameter,rotation,rings
5,Jupiter,Gas giant,11.209,0.41,True
6,Saturn,Gas giant,9.449,0.43,True
7,Uranus,Gas giant,4.007,-0.72,True
8,Neptune,Gas giant,3.883,0.67,True


In [26]:
# Select planets with diameter < 1
subset(planets_df, diameter < 1)

Unnamed: 0,name,type,diameter,rotation,rings
1,Mercury,Terrestrial planet,0.382,58.64,False
2,Venus,Terrestrial planet,0.949,-243.02,False
4,Mars,Terrestrial planet,0.532,1.03,False


##### Summary:

- `data()`, `?`, `help()`
- `head()`, `tail()`, `str()`
- `data.frame(vector1, vector2, vector3, ...)`
- `df[row_index, col_index]`, `df[row_index, col_name]`, `df$col_name`
- `subset()`

### Lists

A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

To construct a list you use the function `list()`:

`my_list <- list(comp1, comp2 ...)`

In [27]:
# Vector with numerics from 1 up to 10
my_vector <- 1:10 

# Matrix with numerics from 1 up to 9
my_matrix <- matrix(1:9, ncol = 3)

# First 10 elements of the built-in data frame mtcars
my_df <- mtcars[1:10,]

# Construct list with these different elements:
my_list <- list(my_vector, my_matrix, my_df)

It is helpful to give names to the components of your list. You can do it with either the `names()` function or in the following way:

`my_list <- list(name1 = your_comp1, name2 = your_comp2)`

In [28]:
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)

# which is equivalent to...
my_list <- list(my_vector, my_matrix, my_df)
names(my_list) <- c("vec", "mat", "df")

One way to select a component is using the numbered position of that component in double square brackets `[[  ]]`. For example, to "grab" the first component of `my_list` you type

`my_list[[1]]`

It is important to remember that to select elements from vectors, you use single square brackets `[  ]`, and from lists, double square brackets `[[  ]]`. Don't forget!

You can also refer to the names of the components, with `[[  ]]` or with the `$` sign.

`my_list[["df"]]`

`my_list$df`

In [29]:
# Print out the matrix
my_list$mat

# Print the second row of the data frame
my_list$df[2,]

0,1,2
1,4,7
2,5,8
3,6,9


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4


To conveniently add elements to lists you can use the `c()` function:

`ext_list <- c(my_list, my_name = new_value)`

In [30]:
my_new_list <- c(my_list, logical = TRUE)

my_new_list[[4]]

##### Summary:

- `my_list <- list(name1 = your_comp1, name2 = your_comp2)`
- `my_list[[1]]`, `my_list[["name1"]]`, `my_list$name1`
- `ext_list <- c(my_list, my_name = new_value)`

Congratulations! At this point in the course you are already familiar with:

- **Vectors** (one dimensional array): can hold numeric, character or logical values. The elements in a vector all have the same data type.
- **Matrices** (two dimensional array): can hold numeric, character or logical values. The elements in a matrix all have the same data type.
- **Lists** (one dimensional object): can hold numeric, character or logical values. The elements in a list can be of different data types.
- **Data frames** (two dimensional object): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data types.