<a href="https://colab.research.google.com/github/stephenfrein/csc8491/blob/main/R_Data_Structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# R Data Structures #

We'll be working with a number of data structures that are fundamental to R programming and analysis.

**Vector** - ordered set of values using the same type

In [2]:
# c is combine function - creates a vector
names = c("Steve","John","Rita")
names
typeof(names)
#single quotes equivalent, but R style is to use double
#names = c('Steve','John','Rita')

In [3]:
# numeric
nums = c(100,200,300)
nums
typeof(nums)

# boolean (logical)
bools = c(TRUE,FALSE,TRUE)
bools
typeof(bools)

In [4]:
#access specific elements with one-based indexing
nums[3]
nums[4] <- 400
nums
typeof(nums)
nums[5] <- "ABC"
nums
typeof(nums)

In [7]:
#ranges
names
names[2:3]
#exclude with minus
names[-3]
#logical vector for inclusions
bools
names[bools]




**Factor** - vector used for nominal (category) variables

*   for a limited number of values
*   stores category labels only once – basically a small lookup table
*   classification algorithms will often expect categorical target variables to be represented as factors






In [17]:
# . is just another character in R names
colors.vector <- c("Red","Blue","Green")
colors.factor <- factor(c("Red","Blue","Green"))
print('Vector below: ')
colors.vector
print('Factor below: ')
colors.factor
print('Vector selection below: ')
colors.vector[2]
print('Factor selection below: ')
colors.factor[2]

[1] "Vector below: "


[1] "Factor below: "


[1] "Vector selection below: "


[1] "Factor selection below: "


**List** - doesn't require all elements to be of same type

* can give names to elements of list and reference the names

In [21]:
list1 <- list("Steve",123, TRUE)
list2 <- list(firstname="Steve",id=123,active=TRUE)
#access specific elements with one-based indexing
print("indexing is one-based")
list2[2]
#or with name-based indexing
print("or it can be name-based")
list2$firstname
#or with a name vector
print("it can use a name vector too")
list2[c("firstname","active")]

[1] "indexing is one-based"


[1] "or it can be name-based"


[1] "it can use a name vector too"


In [None]:
### YOU TRY -  how would you pick some elements of the list using a logical vector? ###
### try it in this cell ###

**Data Frame** - tabular structure with rows and columns

* a list of vectors
* you'll use these often - most model-building libraries with expect your data in this form

In [25]:
name <- c("Steve","John","Rita")
id <- c(100,200,300)
active <- c(TRUE,FALSE,TRUE)
gender <- c("M","M","F")
#create a data frame - basically a list of same-length vectors
family <- data.frame(name, id, active, gender)
print("Show the data frame")
family
print("What's it's structure?")
# str works on all kinds of variables in R
str(family)

[1] "Show the data frame"


name,id,active,gender
<chr>,<dbl>,<lgl>,<chr>
Steve,100,True,M
John,200,False,M
Rita,300,True,F


[1] "What's it's structure?"
'data.frame':	3 obs. of  4 variables:
 $ name  : chr  "Steve" "John" "Rita"
 $ id    : num  100 200 300
 $ active: logi  TRUE FALSE TRUE
 $ gender: chr  "M" "M" "F"


In [26]:
#elements in this list have names
print("Pick one column")
family$name

[1] "Pick one column"


In [28]:
family
#select positionally using [rows, cols] notation
print("Positional selections")
family[1,2]
family[2,1]

name,id,active,gender
<chr>,<dbl>,<lgl>,<chr>
Steve,100,True,M
John,200,False,M
Rita,300,True,F


[1] "Positional selections"


In [31]:
print("Here's a row")
family[2,] #row
print("Here's a column")
family[,4] #column

[1] "Here's a row"


Unnamed: 0_level_0,name,id,active,gender
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>,<chr>
2,John,200,False,M


[1] "Here's a column"


In [32]:
family
print("Grab a slice on two axes")
family[1:2,2:3] #subset
print("Can skip around")
family[c(1,3),c(2,4)] #skip row/column

name,id,active,gender
<chr>,<dbl>,<lgl>,<chr>
Steve,100,True,M
John,200,False,M
Rita,300,True,F


[1] "Grab a slice on two axes"


Unnamed: 0_level_0,id,active
Unnamed: 0_level_1,<dbl>,<lgl>
1,100,True
2,200,False


[1] "Can skip around"


Unnamed: 0_level_0,id,gender
Unnamed: 0_level_1,<dbl>,<chr>
1,100,M
3,300,F


In [34]:
family
print("Remove a row and select by column name")
family[-3,c("name","gender")] #remove & select by name

name,id,active,gender
<chr>,<dbl>,<lgl>,<chr>
Steve,100,True,M
John,200,False,M
Rita,300,True,F


[1] "Remove a row and select by column name"


Unnamed: 0_level_0,name,gender
Unnamed: 0_level_1,<chr>,<chr>
1,Steve,M
2,John,M


Updating Data Frame Contents

In [36]:
# using famous "iris" data set - it just lives forever in many data science environments for demonstration purposes
iris

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


In [37]:
#update all values in column
iris_copy <- iris
mean(iris_copy$Petal.Width)
iris_copy$Petal.Width <- 2
mean(iris_copy$Petal.Width)
head(iris_copy)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,5.1,3.5,1.4,2,setosa
2,4.9,3.0,1.4,2,setosa
3,4.7,3.2,1.3,2,setosa
4,4.6,3.1,1.5,2,setosa
5,5.0,3.6,1.4,2,setosa
6,5.4,3.9,1.7,2,setosa


In [38]:
#update select values
iris_copy <- iris
print("mean and range before update...")
mean(iris_copy$Petal.Width)
range(iris_copy$Petal.Width)
iris_copy$Petal.Width[iris_copy$Petal.Width > 2] <- 2
print("mean and range after update...")
mean(iris_copy$Petal.Width)
range(iris_copy$Petal.Width)


[1] "mean and range before update..."


[1] "mean and range after update..."


Creating subsets of data frames

In [None]:
# subset data based on logical values
# notice == for equality
setosa_iris <- subset(iris, Species == "setosa")
head(setosa_iris)
wide_sepals <- subset(iris, Sepal.Width > 4,
                  select=c(Species, Sepal.Width))
head(wide_sepals)


*HEY STEVE! Go back to slide 15 here! (Love, past you)*

Applying SQL to dataframes
* only does SELECTs, but can do almost any kind of SELECT you want

In [40]:
install.packages("sqldf")
library(sqldf)
sqldf('select [Sepal.Width], Species from iris where [Sepal.Width] > 4')


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘plogr’, ‘gsubfn’, ‘proto’, ‘RSQLite’, ‘chron’


Loading required package: gsubfn

Loading required package: proto

“no DISPLAY variable so Tk is not available”
Loading required package: RSQLite



Sepal.Width,Species
<dbl>,<fct>
4.4,setosa
4.1,setosa
4.2,setosa
