## LECTURE 4: DATA STRUCTURES IN R  (contd)

### STAT598z: Intro. to computing for statistics


***




### Vinayak Rao

#### Department of Statistics, Purdue University

In [None]:
options(repr.plot.width=3, repr.plot.height=3)

## Data frames

Very common and convenient data structures

Used to store tables:
+ Columns are variables and rows are observations


 &#8203;| Age | PhD   | GPA
 ----- |:---:|:-----:| ----
 Alice | 25  | TRUE  |  3.6
 Bob   | 24  | TRUE  |  3.4
 Carol | 21  | FALSE |  3.8

An R data frame is a list of equal length vectors

In [6]:
df <- data.frame(age = c(25L,24L,21L),  # Warning: df is an
                 PhD = c( T , T , F ),  #   R function
                 GPA = c(3.6,2.4,2.8)) 

In [7]:
print(df)

  age   PhD GPA
1  25  TRUE 3.6
2  24  TRUE 2.4
3  21 FALSE 2.8


In [3]:
typeof(df)

In [4]:
class(df)

Since data frames are lists, we can use list indexing

Can also use matrix indexing (more convenient)

In [9]:
print(df[2,'age']) 

[1] 24


In [10]:
print(df[2,])

  age  PhD GPA
2  24 TRUE 2.4


In [11]:
print(df$GPA)

[1] 3.6 2.4 2.8


In [14]:
nrow(df)*ncol(df)

list functions apply as usual

matrix functions are also interpreted intuitively

Useful functions are:
+ 'length(), dim(), nrow(), ncol()'
+ 'names()' (or 'colnames()')', rownames'
+ 'rbind(), cbind()'

In [15]:
rownames(df) <- c("Alice", "Bob", "Carol")

In [18]:
df[4,1] <- 30L; print(df)

      age   PhD GPA
Alice  25  TRUE 3.6
Bob    24  TRUE 2.4
Carol  21 FALSE 2.8
4      30    NA  NA


Many R datasets are data frames 

In [None]:
library("datasets")
class(mtcars)

In [None]:
print(head(mtcars)) # Print part of a large object

## Factors
Categorical variables that take on a finite number of values

+ **Employee type**: `student/staff/faculty`
+ **Grade**: `A/B/C/F`

Useful when variable can take a fixed set of values
(unlike character strings)

R implements these internally as integer vectors

Has two attributes to distinguish from regular integers:

`levels()`  specifies possible values the factor can take
+ E.g. `c("male", "female")`

`class = factor` tells R to check for violations

In [3]:
# Character vector for 4 students
grades_bad <- c("a", "a", "b", "f") 

In [1]:
# Factor vector for 4 students
grades <- factor(c("a", "a", "b", "f"))

In [11]:
print(grades);

[1] a a b f
Levels: a b f


In [6]:
typeof(grades)

In [7]:
class(grades)

In [12]:
levels(grades) # Not quite what we wanted!

In [13]:
grades <- factor(c("a", "a", "b", "f"))
str(grades)

 Factor w/ 3 levels "a","b","f": 1 1 2 3


In [14]:
grades[2] <- "c"

“invalid factor level, NA generated”

In [16]:
str(grades)

 Factor w/ 3 levels "a","b","f": 1 NA 2 3


In [17]:
grades <- factor(c("a","a","b","a","f"),
             levels = c("a","b","c","f"))

In [18]:
str(grades)

 Factor w/ 4 levels "a","b","c","f": 1 1 2 1 4


In [19]:
table(grades)   # table also works with other data-types

grades
a b c f 
3 1 0 1 

Factors can be ordered:

In [20]:
grades <- factor(c("a","a","b","f"),
            levels = c("f","c","b","a"),
            ordered = TRUE )
grades

In [23]:
 grades[1] > grades[3]

`gl()`: Generate factors levels

Usage (from the R documentation):

``` R
gl(n, k, length = n * k, labels = seq_len(n),
   ordered = FALSE )
```
Look at the examples there:

In [None]:
# First control, then treatment:
gl(2, 8, labels = c("Control", "Treat")) 

In [None]:
gl(2, 1, 20) # 20 alternating 1s and 2s

In [None]:
gl(2, 2, 20) # alternating pairs of 1s and 2s

## An aside on assignment
From the R language definition:

`x[3 : 5] <- 13 : 15`

is as if the following had been executed

``` R
'*tmp*' <- x  # Don't use your own *tmp* variables!
x       <- "[<-"('*tmp*', 3 : 5, value = 13 : 15)
rm('*tmp*') 
# ls() lists all objects in current session
```

From the R language definition:

`names(x) <- c("a","b")`

is equivalent to

```  R
  '*tmp*' <- x
  x <- "names<-"('*tmp*', value = c("a","b")) 
  # Note names<-
  rm('*tmp*')
```