# Week 08: The R Language

## R and Python

Matlab, R and Python are the three languages that you will often hear talked about in Data Science. 

**Python**
- General purpose language
- Python is an object-oriented, high-level programming language (create classes)
- Open source (free to use and develop)
- Great for data structures and programming in general, it has a vast collection of libraries that you can use - but they can be tricky to install. 

**R**
- Functional programming (uses functions)
- Oriented to statistical analysis and data processing in small scale. 
- It has a very huge collection of packages to do almost anything you might imagine with data (they are easy to install).
- As a general programming language it is much slower than python and the syntax is arguably not as clear.

### Python to R

There are a lot of similarities between R and Python - but this can also be a common source of error.

#### Assignment & Arithmatic Operators

Python
``` python
# Assignment
    a, b = 10,20
# Power
    a ** b
# Remainder (Modulus)
    a % b
```
R
``` R
# Assignment
    a <- 10 ; b <- 20
# Power
    a ^ b
# Remainder (Modulus)
    a %% b
```


#### Logical Operators

An Element-wise or simple logical operator will consider a enitre object whereas a Short-circuit operator stops as soon as it evaluates a test that produces a specified result. 

For short circuit "and" no tests are evaluated after the first "false". 

For short circuit "or" no tests are evaluated after the first true.

Python
``` python
# Short-circuit AND
    a and b
# Short-circuit OR
    a or b
# Element-wise AND
    a and b
# Element-wise OR
    a or b
```
R
``` R
# Short-circuit AND
    a && b
# Short-circuit OR
    a || b
# Element-wise AND
    a & b
# Element-wise OR
    a | b
```

#### Sequences

Python
``` python
# 1, 2, [...] 10
    range(1,11)
# List of numeric type
    x = [2, 3, 0, 6]
# Updating at a index
    x[i] = 5

```
R
``` R
# 1, 2, [...] 10
    seq(10)
    1:10
# List of numeric type
    x <- c(1, 2.0, 4, 5)
# Updating at a index
    x[i] <- 5
```

#### Concatinate 

```c()``` is a generic function which combines its arguments.

By default it combines its arguments to form a vector.
``` R
c(1, 2, 3, 4, 5)
```
All arguments are coerced (forced) to a common type which is the type of the returned value. 
``` R
x <- c(1 , 2, 3, "four")

> print(x)
# Output
"1", "2", "3", "four"

typeof(x)
# Output
"character"
```

All attributes (labeled values you can attach to a object) except names (labels) are removed.

Note that numbers are stored as doubles (floats) by default in R. If you want to save a number as a integer, you need to use the suffix ```L```. This is shown below.

``` R
x <- c(1L, 2L, 3L)
```

### Indexing

#### Question

Say I'm working on a problem that involves the list of numbers years
```
1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000
```
I save a list of these dates in the variable ```dates_list```.



- **Q1.** In Python, what date would I get if I ran ```dates_list[2]```

- **Q2.** What would I get in R for ```dates_list[2]```?

In [22]:
# Lists in R are simular to Dictionaries in Python


x <- c(1, 2, 3, 4, 5)

num_list <- list(one = x,
                 two = c("one", "two", "three"),
                 three = matrix(data = 1:6, nrow = 3, ncol = 3))

# NOTE: Matrix() creates a matrix (table) from a given set of values

In [23]:
print(num_list)

$one
[1] 1 2 3 4 5

$two
[1] "one"   "two"   "three"

$three
     [,1] [,2] [,3]
[1,]    1    4    1
[2,]    2    5    2
[3,]    3    6    3



### Indexing

In [24]:
# We can index a list
print(num_list["two"])

# We can access data inside a list element by combining double and single brackets. 
# By using the double brackets, the list structure is dropped.
print(class(num_list["three"]))
print(class(num_list[["three"]]))

# We can also index specific elements with $
num_list$two

$two
[1] "one"   "two"   "three"

[1] "list"
[1] "matrix"


### Dataframes

In [25]:
my_dataframe <- data.frame(title = c("Dr", "Prof", "Prof"),
                         fname = c("Sian", "Milena", "Friedrich"),
                         favenum = c(13 , 99, 144))

my_dataframe$lname <- c("Brooke", "Ttvetkova", "Geicke") 

print(my_dataframe)

  title     fname favenum     lname
1    Dr      Sian      13    Brooke
2  Prof    Milena      99 Ttvetkova
3  Prof Friedrich     144    Geicke


### Control Flow

In [26]:
x <- 5
y <- 10

# Indentation is not strictly necessary, but preferred for readability. 

# The if code block is in rounded brackets
if (x < y) {
    print("x is smaller than y!")
    # the else if code block is also in rounded brackets
}   else if (y < x) {
    print("y is smaller than x!")
}   else {
    print("no number is smaller")
} # the code in the curly brackets will be ran if the conditional statement is triggered.



[1] "x is smaller than y!"


In [27]:
# For and while loops work pretty much the same way.

chr_vec <- c("this", "is", "how", "a", "for", "loop", "works")
x <- 5

# For loop
for (txt in chr_vec){
    print(txt)
}

# While loop
while (x < 10){
    print(x)
    x <- x + 1
}

[1] "this"
[1] "is"
[1] "how"
[1] "a"
[1] "for"
[1] "loop"
[1] "works"
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9


In [28]:
# Simular to range() in python ...

for (i in 1:length(chr_vec)) { # for i in range(0, len(chr_vec))
    print(i)
    print(chr_vec[i])
}

[1] 1
[1] "this"
[1] 2
[1] "is"
[1] 3
[1] "how"
[1] 4
[1] "a"
[1] 5
[1] "for"
[1] 6
[1] "loop"
[1] 7
[1] "works"


### Functions

In [29]:
# Functions also make use of different brackets

class_fits <- function(num_teachers = 1, num_students, room_size) {
    if ((num_teachers + num_students) < room_size) {
        print("Hooray! The room is big enough")
    } else {
        print("Oh no, the room is too small") }
    }

# Can we write this funtion in Python?

# We can supply a value for each arguement
class_fits(2, 13, 18)

# Becuase we supplied a default for teachers ...
class_fits(num_students = 18, room_size = 19)

# Like python, punctions still have a local name scope.

[1] "Hooray! The room is big enough"
[1] "Oh no, the room is too small"


### Pipe Operator

In [3]:
library(tidyverse)

# You might have noticed that you code in R often contains a lot of parenthesis. 
# When you have complex code, this will often mean that you will have to nest
# those parentheses together. This makes you R code hard to read and understand.
# The pipe operator is for this exact purpose.

x <- c(0.322, 0.237, 0.342, 0.983, 0.987 , 0.991, 0.129)

# Compute the logarithm of x, compute the exponential function, round the result
round(exp(log(x)), 1)

# with pipe this is:
x %>% log() %>%
    exp() %>%
    round(1)

# NOTE: you don't need to include the brackets (i.e. log()) here,
# but doing so increases the legibility of your code.

###  Recap: Terminology in R

**Vector**
-  Sequence of data elements of the same basic type.
- An vector is either logical , integer , numeric , complex , character or raw and can have any attributes except a dimension attribute. 
- A **vectorised operation** refers to a the ability to do single mathematical operation on a list -- or "vector" -- of numbers in a single step.

**Lists**
- Closest in python is a Dictionary
- Has a key, value structure

**Matrix**
- Arranges data from a vector into a table. 
- Data has to be the same type.

**Data Frame**
- A matrix-like R object in which the columns can be different types (numeric, character, logical etc.).

**Factors**
- Used to represent categorical data. 
- Can be ordered or unordered and are an important class for statistical analysis and for plotting.

**Dot Product**
- The dot product is an algebraic operation that takes two equal-length sequences of numbers, and returns a single number.

![dot_product.jpg](attachment:dot_product.jpg)

## Your Turn!

Open up RStudio and have a go! These files can be found the lectures repo. 

If you are new to R, start with **01-rmarkdown**.

If you have been taking MY472 (Applied Regression) or have worked with R before, go straight to **02-exercises**

If you finish all that, feel free to get started on the assignment for this week.


In [None]:
# Exercise 1A.

# This code creates a vector `vector_of_squares` which contains the squared
# elements of the vector x

# Loading the relevant package to use the `address()` function
library(____)

# Creating an exemplary vector with 10 elements
x <- ____

# Creating an empty vector with correct final length
vector_of_squares <- numeric(length(x))
  
for (i in 1:length(x)) {
  
  # Replacing the relevant element of the empty vector with the associated square
  vector_of_squares[i] <- x[i]*x[____]
  
  # Printing out the memory address of the vector
  print(address(vector_of_squares))
}

vector_of_squares


In [32]:
# Exercise 1A. ANSWERS

library(pryr)

# This code also creates a vector `vector_of_squares` which contains the squared
# elements of the vector x

# Creating an exemplary vector with 10 elements
x <- 1:10

# Creating an empty vector with correct final length
vector_of_squares <- numeric(length(x)) # or with zero length <- c()
  
for (i in 1:length(x)) {
  
  # Replacing the relevant element of the empty vector
  vector_of_squares[i] <- x[i]*x[i]
  
  print(address(vector_of_squares))
}

vector_of_squares

[1] "0x61fe5e8"
[1] "0x61fe5e8"
[1] "0x61fe5e8"
[1] "0x61fe5e8"
[1] "0x61fe5e8"
[1] "0x61fe5e8"
[1] "0x61fe5e8"
[1] "0x61fe5e8"
[1] "0x61fe5e8"
[1] "0x61fe5e8"


In [None]:
# Exercise 1B.

# Creating an empty vector with length zero
vector_of_squares <- numeric(0)

for (i in 1:length(x)) {
  
  # Appending every new element to the vector which implicitly creates a copy
  vector_of_squares <- append(____, x[i]*x[____])
  
  # Printing out the memory address of the vector
  print(address(vector_of_squares))
}

vector_of_squares

In [33]:
# Exercise 1B. ANSWERS

# This code also creates a vector `vector_of_squares` which contains the squared
# elements of the vector x

# Creating an empty vector with length zero
vector_of_squares <- numeric(0)
for (i in 1:length(x)) {
  
  # Appending every new element to the vector which implicitly creates a copy
  vector_of_squares <- append(vector_of_squares, x[i]*x[i])
  
  # Printing out the memory address of the vector
  print(address(vector_of_squares))
}
vector_of_squares

[1] "0x3f7d2600"
[1] "0x6c85da8"
[1] "0x6db3118"
[1] "0x6db3028"
[1] "0x8c73dc8"
[1] "0x8c78f48"
[1] "0x8c78d88"
[1] "0x8c78ae8"
[1] "0x8db1fd0"
[1] "0x8db1d10"


## Functions

In [34]:
# HINT: Function

# User defined function

some_vector <- c(1,2,5)
my_function <- function(x) {
  x + 5
}
# `sapply` applies a function to every element of a vector
sapply(some_vector, my_function)

# .Anonymous functions

vector_of_squares <- c(1,2,5)
# The anonymous function works as well and does not require you to define a
# separate function
sapply(some_vector, function(y) y + 5)

In [None]:
# Exercise 2. Vector functions

vector_length_a <- function(x) {

  #
  # This function computes the vector length with a loop, modifying the
  # vector in the same memory location
  #
  
  # Creating an empty vector with correct final length
  vector_of_squares <- numeric(length(____))
  
  # Filling out the vector
  for (i in 1:length(x)) {
    
    vector_of_squares[i] <- x[i]*x[i]
    
  }
  
  # Obtaining the dot product
  dot_product <- sum(____)
  
  # Obtaining the final vector length
  vector_length <- sqrt(____)
  
  return(vector_length)
  
}


vector_length_b <- function(x) {

  #
  # This function computes the vector length with a loop, however, the
  # code creates copies of the vector implicitly
  #
  
  # Creating an empty vector of length zero
  vector_of_squares <- numeric(0)
  
  # Filling out the vector
  for (i in 1:length(x)) {
    
    ____ <- append(____, x[i]*x[i])
    
  }
  
  # Obtaining the dot product
  dot_product <- sum(____)
  
  # Obtaining the final vector length
  vector_length <- sqrt(____)
  
  return(vector_length)
  
}
  

  
vector_length_c <- function(x) {
  
  #
  # This function uses an apply approach to avoid writing the for loop
  # explicitly
  #
  
  # Using sapply and an anonymous function to compute the vector of squares
  vector_of_squares <- sapply(x, ____)
  
  # Obtaining the dot product
  dot_product <- sum(____)
  
  # Obtaining the final vector length
  vector_length <- sqrt(____)
  
  return(vector_length)

}

vector_length_d <- function(x) {
  
  #
  # This function uses the operator for matrix multiplication in R
  #
  
  dot_product <- x____x
  vector_length <- sqrt(dot_product[1,1])
  
  return(vector_length)

}

vector_length_e <- function(x) {
  
  #
  # This function uses element wise multiplication
  #
  
  dot_product <- sum(x____x)
  vector_length <- sqrt(dot_product)
  
  return(vector_length)


In [35]:
# Exercise 2. Vector functions - Answers


vector_length_a <- function(x) {
  #
  # This function computes the vector length with a loop, modifying the
  # vector in the same memory location
  #
  
  # Creating an empty vector with correct final length
  vector_of_squares <- numeric(length(x))
  
  # Filling out the vector
  for (i in 1:length(x)) {
    
    vector_of_squares[i] <- x[i]*x[i]
    
  }
  
  # Obtaining the dot product
  dot_product <- sum(vector_of_squares)
  
  # Obtaining the final vector length
  vector_length <- sqrt(dot_product)
  
  return(vector_length)
  
}
vector_length_b <- function(x) {
  #
  # This function computes the vector length with a loop, however, the
  # code creates copies of the vector implicitly
  #
  
  # Creating an empty vector of length zero
  vector_of_squares <- numeric(0)
  
  # Filling out the vector
  for (i in 1:length(x)) {
    
    vector_of_squares <- append(vector_of_squares, x[i]*x[i])
    
  }
  
  # Obtaining the dot product
  dot_product <- sum(vector_of_squares)
  
  # Obtaining the final vector length
  vector_length <- sqrt(dot_product)
  
  return(vector_length)
  
}
  
  
vector_length_c <- function(x) {
  
  #
  # This function uses an apply approach to avoid writing the for loop
  # explicitly
  #
  
  # Using sapply and an anonymous function to compute the vector of squares
  vector_of_squares <- sapply(x, function(x) x^2)
  
  # Obtaining the dot product
  dot_product <- sum(vector_of_squares)
  
  # Obtaining the final vector length
  vector_length <- sqrt(dot_product)
  
  return(vector_length)
}
vector_length_d <- function(x) {
  
  #
  # This function uses the operator for matrix multiplication in R
  #
  
  dot_product <- x%*%x
  vector_length <- sqrt(dot_product[1,1])
  
  return(vector_length)
}
vector_length_e <- function(x) {
  
  #
  # This function uses element wise multiplication
  #
  
  dot_product <- sum(x*x)
  vector_length <- sqrt(dot_product)
  
  return(vector_length)
}

In [36]:
# Check whether all functions return the same outcome:

some_example_vector <- c(1,7,24,5)


vector_length_a(some_example_vector)
vector_length_b(some_example_vector)
vector_length_c(some_example_vector)
vector_length_d(some_example_vector)
vector_length_e(some_example_vector)

In [37]:
# Some arbitrary vector with 10000 elements of which we will determine the length
x <- 42:10041

# Number of repetitions when timing the code (time in individual repetitions fluctuates a lot)
n <- 10

In [38]:
# Exercise 3. Time to compute a vector

# Computing the outcome with a loop that modifies the vector in the same memory location:
system.time(for (i in 1:n) vector_length_a(x))

# Computing the outcome with a loop that implictly creates a copy of the vector in each iteration:
system.time(for (i in 1:n) vector_length_b(x))

# Computing the outcome with an `sapply` function:
system.time(for (i in 1:n) vector_length_b(x))

# Computing the outcome with vectorised code and a dot product:
system.time(for (i in 1:n) vector_length_d(x))

   user  system elapsed 
   0.01    0.00    0.01 

   user  system elapsed 
   2.52    0.07    2.63 

   user  system elapsed 
   2.36    0.05    2.41 

   user  system elapsed 
   0.00    0.01    0.02 