<a href="https://colab.research.google.com/github/nmagee/ds1002/blob/main/notebooks/21-dataframes-in-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Frames in R

To begin to understand data frames in R, let's build a simple example by hand with three vectors we want to relate in a table.

In [7]:
album <- c("The Low End Theory", "Nevermind", "Port of Morrow",
           "Dark Side of the Moon", "Naked", "OK Computer",
           "Abbey Road", "Thriller", "Rumours", "The Joshua Tree")
year <- c(1991, 1991, 2012, 1973, 1988, 1997, 1969, 1982, 1977, 1987)
digital <- c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE)

df <- data.frame(Album = album, Year = year, Digital = digital)

In [8]:
# display the df
df

Album,Year,Digital
<chr>,<dbl>,<lgl>
The Low End Theory,1991,False
Nevermind,1991,False
Port of Morrow,2012,True
Dark Side of the Moon,1973,False
Naked,1988,True
OK Computer,1997,True
Abbey Road,1969,False
Thriller,1982,False
Rumours,1977,False
The Joshua Tree,1987,True


## Column Naming

We can also name the columns of a data frame, whether we imported it or built it by hand:

In [None]:
df <- data.frame(Album = album, Year = year, Digital = digital)

In [None]:
names(df) <- c("Album", "Year", "Digital")

## Insert a Column

Data can always be added or removed from a data frame after its creation.

In [9]:
df$Year

In [10]:
df$Artist <- c("A Tribe Called Quest", "Nirvana", "The Shins", "Pink Floyd", "Talking Heads", "Radiohead", "The Beatles", "Michael Jackson", "Fleetwood Mac", "U2")

In [11]:
df

Album,Year,Digital,Artist
<chr>,<dbl>,<lgl>,<chr>
The Low End Theory,1991,False,A Tribe Called Quest
Nevermind,1991,False,Nirvana
Port of Morrow,2012,True,The Shins
Dark Side of the Moon,1973,False,Pink Floyd
Naked,1988,True,Talking Heads
OK Computer,1997,True,Radiohead
Abbey Road,1969,False,The Beatles
Thriller,1982,False,Michael Jackson
Rumours,1977,False,Fleetwood Mac
The Joshua Tree,1987,True,U2


## Structure

To fetch the structure of a data frame, use `str()`. This tells us how many observations (rows) it contains along with how many variables (columns) each observation contains.

The output looks much like a list. This shouldn't be surprising, since the dataframe can contain mixed data types. However, each column of variables must be of the same data type. (Names are all strings, Years are all numbers, and Vinyl is all bools/logicals.)

In [12]:
str(df)

'data.frame':	10 obs. of  4 variables:
 $ Album  : chr  "The Low End Theory" "Nevermind" "Port of Morrow" "Dark Side of the Moon" ...
 $ Year   : num  1991 1991 2012 1973 1988 ...
 $ Digital: logi  FALSE FALSE TRUE FALSE TRUE TRUE ...
 $ Artist : chr  "A Tribe Called Quest" "Nirvana" "The Shins" "Pink Floyd" ...


## Selecting Columns

Simply indicate the indexes for the rows you want in the second half of the slice brackets `[ ]`. This can be a range separated by a colon `:`.

In [None]:
df[ROWS, COLUMNS]

In [13]:
df[ ,1:2]

Album,Year
<chr>,<dbl>
The Low End Theory,1991
Nevermind,1991
Port of Morrow,2012
Dark Side of the Moon,1973
Naked,1988
OK Computer,1997
Abbey Road,1969
Thriller,1982
Rumours,1977
The Joshua Tree,1987


In [14]:
df[1:5,]

Unnamed: 0_level_0,Album,Year,Digital,Artist
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>,<chr>
1,The Low End Theory,1991,False,A Tribe Called Quest
2,Nevermind,1991,False,Nirvana
3,Port of Morrow,2012,True,The Shins
4,Dark Side of the Moon,1973,False,Pink Floyd
5,Naked,1988,True,Talking Heads


In [15]:
df[,]

Album,Year,Digital,Artist
<chr>,<dbl>,<lgl>,<chr>
The Low End Theory,1991,False,A Tribe Called Quest
Nevermind,1991,False,Nirvana
Port of Morrow,2012,True,The Shins
Dark Side of the Moon,1973,False,Pink Floyd
Naked,1988,True,Talking Heads
OK Computer,1997,True,Radiohead
Abbey Road,1969,False,The Beatles
Thriller,1982,False,Michael Jackson
Rumours,1977,False,Fleetwood Mac
The Joshua Tree,1987,True,U2


In [20]:
df[ , c(1,4)]

Album,Artist
<chr>,<chr>
The Low End Theory,A Tribe Called Quest
Nevermind,Nirvana
Port of Morrow,The Shins
Dark Side of the Moon,Pink Floyd
Naked,Talking Heads
OK Computer,Radiohead
Abbey Road,The Beatles
Thriller,Michael Jackson
Rumours,Fleetwood Mac
The Joshua Tree,U2


In [17]:
df[ 6, 4]

In [19]:
df[  , -(2:3) ]

Album,Artist
<chr>,<chr>
The Low End Theory,A Tribe Called Quest
Nevermind,Nirvana
Port of Morrow,The Shins
Dark Side of the Moon,Pink Floyd
Naked,Talking Heads
OK Computer,Radiohead
Abbey Road,The Beatles
Thriller,Michael Jackson
Rumours,Fleetwood Mac
The Joshua Tree,U2


## Slicing & Filtering

Indicate the indexes of the row you want, appended with a comma and empty value for a column specification. This can be the first half of the slice brackets `[ ]`.

The command below asks for all columns of records 1-3:

In [21]:
df

Album,Year,Digital,Artist
<chr>,<dbl>,<lgl>,<chr>
The Low End Theory,1991,False,A Tribe Called Quest
Nevermind,1991,False,Nirvana
Port of Morrow,2012,True,The Shins
Dark Side of the Moon,1973,False,Pink Floyd
Naked,1988,True,Talking Heads
OK Computer,1997,True,Radiohead
Abbey Road,1969,False,The Beatles
Thriller,1982,False,Michael Jackson
Rumours,1977,False,Fleetwood Mac
The Joshua Tree,1987,True,U2


In [23]:
df[ c(1,9,4,5) , c(4,2)]

Unnamed: 0_level_0,Artist,Year
Unnamed: 0_level_1,<chr>,<dbl>
1,A Tribe Called Quest,1991
9,Fleetwood Mac,1977
4,Pink Floyd,1973
5,Talking Heads,1988


In [None]:
df[1:3,]

Or this example asks for records 1-3, and columns 1-2.

In [None]:
df[1:3,1:2]

Select specific rows by combining them into the first half of the slice bracket `[ ]`.

In [None]:
df[c(1,3,4),]

In [25]:
df[ , c("Album","Artist")]

Album,Artist
<chr>,<chr>
The Low End Theory,A Tribe Called Quest
Nevermind,Nirvana
Port of Morrow,The Shins
Dark Side of the Moon,Pink Floyd
Naked,Talking Heads
OK Computer,Radiohead
Abbey Road,The Beatles
Thriller,Michael Jackson
Rumours,Fleetwood Mac
The Joshua Tree,U2


Select specific rows based on filter. Here we **query** the data frame for all albums with a `FALSE` value for the `vinyl` column.

In [24]:
df

Album,Year,Digital,Artist
<chr>,<dbl>,<lgl>,<chr>
The Low End Theory,1991,False,A Tribe Called Quest
Nevermind,1991,False,Nirvana
Port of Morrow,2012,True,The Shins
Dark Side of the Moon,1973,False,Pink Floyd
Naked,1988,True,Talking Heads
OK Computer,1997,True,Radiohead
Abbey Road,1969,False,The Beatles
Thriller,1982,False,Michael Jackson
Rumours,1977,False,Fleetwood Mac
The Joshua Tree,1987,True,U2


In [None]:
df[df$Digital == FALSE,]

In [28]:
df[df$Digital != FALSE , ]

Unnamed: 0_level_0,Album,Year,Digital,Artist
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>,<chr>
3,Port of Morrow,2012,True,The Shins
5,Naked,1988,True,Talking Heads
6,OK Computer,1997,True,Radiohead
10,The Joshua Tree,1987,True,U2


In [None]:
# Now we can search a larger data set:
df[df$Digital == TRUE,]

In [29]:
# Or using mathematical operators to filter:

df[df$Year > 1990,]

Unnamed: 0_level_0,Album,Year,Digital,Artist
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>,<chr>
1,The Low End Theory,1991,False,A Tribe Called Quest
2,Nevermind,1991,False,Nirvana
3,Port of Morrow,2012,True,The Shins
6,OK Computer,1997,True,Radiohead


In [32]:
# Or combine operators to filter more carefully. All comparison operators can be used
# ( ==, !=, <, >, <=, <= )
#
# as well as all logical operators
#   - AND: &
#   -  OR: |
#   - NOT: !

df[df$Year > 1990 & df$Digital == TRUE,]

Unnamed: 0_level_0,Album,Year,Digital,Artist
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>,<chr>
3,Port of Morrow,2012,True,The Shins
6,OK Computer,1997,True,Radiohead


In [34]:
df[df$Year > 1990 | df$Digital == FALSE,]

Unnamed: 0_level_0,Album,Year,Digital,Artist
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>,<chr>
1,The Low End Theory,1991,False,A Tribe Called Quest
2,Nevermind,1991,False,Nirvana
3,Port of Morrow,2012,True,The Shins
4,Dark Side of the Moon,1973,False,Pink Floyd
6,OK Computer,1997,True,Radiohead
7,Abbey Road,1969,False,The Beatles
8,Thriller,1982,False,Michael Jackson
9,Rumours,1977,False,Fleetwood Mac


## Sorting

In [None]:
df_by_year <- df[order(df$Year),]
df_by_year

## `tidyverse`

All of the above operations -- selecting, querying, filtering, sorting -- are also possible (more easily) using methods built into the `tidyverse` library.

Import that and then we will review those operations.

```
Understand the PIPE in tidyverse: %>%
```

In [36]:
install.packages("tidyverse")
library(tidyverse)

In [None]:
%>% = piping

In [37]:
df %>%
  select(Album, Year) %>%
  filter(Year > 1990)

Album,Year
<chr>,<dbl>
The Low End Theory,1991
Nevermind,1991
Port of Morrow,2012
OK Computer,1997


## Import CSV Data

Just as with Pandas in Python, it is much more common to load data from files, such as CSV. This is possible manually, as well as by using the `tidyverse` library.

### Import a CSV Manually

R has a native `read.csv()` method. A few things to note as you run this cell:

- This method automatically creates a data frame from the CSV data.
- The file can be local or remote via URL.
- Since the data file has a header row, names are automatically assigned to columns.

In [None]:
titanic <- read.csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
titanic

We can now use the same filtering methods as above to inspect the data. Here we save each successive query as a new data frame.

In [None]:
titanic %>%
  select(Pclass, Name, Sex, Fare, Age) %>%
  filter(Sex == "female") %>%
  filter(Pclass == 3) %>%
  filter(Age < 20) %>%
  arrange(Age)

In [None]:
df2 <- titanic %>%
  select(Pclass, Name, Sex, Fare, Age) %>%
  filter(Sex == "female") %>%
  filter(Pclass == 3) %>%
  filter(Age < 20) %>%
  arrange(Age)

Or we can explore one of the built-in data sets from R. In this case let's use the `starwars` data set.

In [41]:
data()

In [None]:
str(starwars)

In [None]:
head(starwars)

In [46]:
starwars %>%   # and-then
  select(gender, mass, height, species) %>%
  filter(species == "Human") %>%
  na.omit()

gender,mass,height,species
<chr>,<dbl>,<int>,<chr>
masculine,77.0,172,Human
masculine,136.0,202,Human
feminine,49.0,150,Human
masculine,120.0,178,Human
feminine,75.0,165,Human
masculine,84.0,183,Human
masculine,77.0,182,Human
masculine,84.0,188,Human
masculine,80.0,180,Human
masculine,77.0,170,Human


In [49]:
starwars %>%   # and-then
  select(gender, mass, height, species) %>%
  filter(species == "Human") %>%
  na.omit() %>%
  mutate(height = height / 100) %>%
  mutate(BMI = mass / height^2) %>%
  group_by(gender) %>%
  summarize(Average_BMI = mean(BMI))

gender,Average_BMI
<chr>,<dbl>
feminine,21.95164
masculine,26.04427


Finally, we can explore the `msleep` (Mammal Sleep) sample data set using `tidyverse`.

In [53]:
my_data <- msleep %>%
  select(name, order, bodywt, sleep_total) %>%
  filter(order == "Primates", bodywt > 20) %>%
  arrange(bodywt)

my_data

name,order,bodywt,sleep_total
<chr>,<chr>,<dbl>,<dbl>
Baboon,Primates,25.235,9.4
Chimpanzee,Primates,52.2,9.7
Human,Primates,62.0,8.0


## Extract a **Column** into a Vector

To extract a column of attributes back into a vector, call it out by appending `$ColName` to the data frame.

In [52]:
# assign into a var
albums_extracted <- df$Album
albums_extracted