In [None]:
options(jupyter.rich_display = FALSE)

# Creating, subsetting and manipulating vectors
**Written and Partially Adapted by Serhat Çevikel**

Emre Can and Ceren are good friends. Sometimes they race on bridges:

[![Race on the Bridge](https://refractionsfilm.files.wordpress.com/2013/01/jules-and-jim-1.jpg)](https://www.youtube.com/embed/_i3NxEhqcFg?start=29&end=37)


## Understanding R

> “To understand computations in R, two slogans are helpful:
> 
> Everything that exists is an object.
>
> Everything that happens is a function call."
>
> — John Chambers

(http://adv-r.had.co.nz/Functions.html)

<img src="https://statweb.stanford.edu/~jmc4/CopyPhoto.jpg" width="200">

## Creating a vector, getting info on objects

Create a vector named "worker_age" with Emre=25, Can=52, Ceren=33

In [None]:
worker_age <- c(Emre = 25, Can = 52, Ceren = 33)

Get the item names and values:

In [None]:
worker_age

And class of the vector object:

In [None]:
class(worker_age)

So we have a vector with numeric values

Now let's get info on the attributes of "worker_age" object, such as names:

In [None]:
attributes(worker_age)

To get more info on the object (class, summary data, attributes):

In [None]:
str(worker_age)

What is the class of "class"?

In [None]:
class(class)

In R, functions are also objects, as well as vectors

And, what is the class of an object with a single value?

In [None]:
worker_age <- 25
class(worker_age)

It is still a numeric vector. In R there is no separate object type for "scalar" or single values. "Atomic" object is the vector

Get info on class, attributes and `str` function:

In [None]:
?class

In [None]:
?attributes

In [None]:
?str

### Splitting across lines

Create the same vector, putting each value in a separate line

Get the values and class

In [None]:
worker_age <- c(Emre = 25,
                Can = 52,
                Ceren = 33)

In [None]:
worker_age

In [None]:
class(worker_age)

Nothing changes. For better readability you can divide long lines from commas 

## Naming and unnaming a vector

Create the same vector, first assign the values and then assign the names separately

In [None]:
worker_age <- c(25, 52, 33)
worker_age

In [None]:
names(worker_age)

In [None]:
names(worker_age) <- c("Emre", "Can", "Ceren")
worker_age

Print the vector [with or without](https://www.youtube.com/watch?v=6DeDzsCGbsQ) the names:

In [None]:
worker_age

In [None]:
unname(worker_age)

Get info on names and unname functions:

In [None]:
?names

In [None]:
?unname

## c() function

`c()` function combines r vector objects into a single vector.

Note that all items in an R vector are of the same type (or they are coerced to be so, it they are not)

Get info on `c()` function

In [None]:
?c

Get the contents (code) of the c function

In [None]:
c

Create an object named "c" and assign an arbitrary value to it

In [None]:
c <- 1

Now try to get the contents (code) of the c function

In [None]:
c

Now create a vector of numeric values 2 and 3 by using function c, w/o assignment

In [None]:
c(2, 3)

See, when you create an object whose name coincides with a built-in function, the function still works, but you won't be able to see the inner workings of the function when needed and will create a confusion and inconvenience. For good R style, don't use built-in function names for your own objects

**EXERCISE 1**

Create a vector of items named by the first names of the members of your favorite band

And values will be their corresponding instruments

The name of the vector will be the name of the band (with no spaces)

**SOLUTION 1:**

In [None]:
led_zeppelin <- c("Jimmy" = "Guitar",
                    "Robert" = "Vocals",
                    "John Paul" = "Bass",
                    "Bonzo" = "Drums")
led_zeppelin

**EXERCISE 2:**

Create the band vector with only instruments (nonames)

Change the names into full names.

**SOLUTION 2:**

In [None]:
led_zeppelin <- c("Guitar", "Vocals", "Bass", "Drums")
names(led_zeppelin) <- c("Jimmy", "Robert", "John Paul", "Bonzo")
led_zeppelin

## Subsetting a vector

Create a vector **worker_income** with Emre=36000, Can=38500, Ceren=40700 <p></p> Get the values and class

In [None]:
worker_income <- c(Emre = 36000, Can = 38500, Ceren = 40700)

In [None]:
worker_income

In [None]:
class(worker_income)

Create a vector of values 2 and 3 by `c()` function, w/o assignment

In [None]:
c(2, 3)

Create a vector of values from 2 to 3 by using the colon operator `:`, w/o assignment

In [None]:
2:3

Get information on the colon operator

In [None]:
?":"

Create a vector of FALSE, TRUE and TRUE values, w/o assignment

In [None]:
c(F, T, T)

### Subset with a vector of indices

Let's create the objects and assign to names in any case:

In [None]:
worker_age <- c(Emre = 25, Can = 52, Ceren = 33)
worker_income <- c(Emre = 36000, Can = 38500, Ceren = 40700)

worker_age

Get the ages of Can and Ceren by combining relevant indices using `c()` function

In [None]:
worker_age[2]
worker_age[c(2,3)]

We subsetted the 2nd and third indices of the **worker_age** object.

We used a vector of those values with c(2,3).

However this vector is not assigned to a name.

To use it again, we have to create it again

Get information on subsetting or extracting operator `[`

In [None]:
?"["

Now assign the vector created by c(2,3) to **indices_1** object

In [None]:
indices_1 <- c(2, 3)

And view the contents of it:

In [None]:
indices_1
class(indices_1)

Get the same values by using **indices_1** vector

In [None]:
worker_age[indices_1]

In [None]:
worker_age[c(3,1)]

Get the ages of Can and Ceren by combining relevant indices using `:` colon operator

In [None]:
worker_age[2:3]

When the indices are consecutive or with same distances apart, you can use the colon operator to create a vector of indices

Now assign 2:3 to **indices_2** object

In [None]:
indices_2 <- 2:3

Get the ages using **indices_2** object

In [None]:
worker_age[indices_2]

### Subset with a vector of boolean values

Get the ages of Can and Ceren by using a boolean vector that says which items to include **T** or exclude **F**

In [None]:
worker_age[c(F,T,T)]

Now assign the boolean vector to object **bool_1**

In [None]:
bool_1 <- c(F, T, T)

Get the ages using this boolean vector

In [None]:
worker_age[bool_1]

Now subset the **worker_age** vector with just one **TRUE** value. Does it return only the first item?

In [None]:
worker_age[T]

Why did it return the whole vector although we supplied a single logical value?

R automatically recycled the vector wtih single **T** value to the length of the **worker_age** vector.

Now subset it with all **T** logical values

In [None]:
worker_age[c(T,T,T)]

Now do the same thing but intentionally confuse `[` with `(`

In [None]:
worker_age(c(T, T, T))

Why? Because `(` **must come after a function and there is no function named worker_age**

### Subset with a vector of names

Now get the ages by using Can's and Ceren's names, w/o combining them with `c()`

In [None]:
worker_age["Can", "Ceren"]

See, when the indices are not combined into an object (whether assigned or not), they are treated as separate arguments. And since a vector has only one dimension, the second argument gives an error

Now get the ages using names, combining them with `c()`

In [None]:
worker_age[c("Ceren", "Can")]

Now assign Can and Ceren names to a vector called **names_1**

In [None]:
names_1 <- c("Can", "Ceren")

And get the ages using **names_1** vector

In [None]:
worker_age[names_1]

### Subset summary

See, you can subset or extract items from a vector by multiple ways: A vector of indices, a vector of boolean/logical values, a vector of names (if the vector items are named)

Now get the classes of all vectors that we used for subsetting **worker_age**

In [None]:
class(indices_1)

In [None]:
class(indices_2)

In [None]:
class(bool_1)

In [None]:
class(names_1)

See, c(2,3) creates a numeric object while 2:3 creates an integer object

Now list the names of the items in **worker_age**

In [None]:
names(worker_age)

Now get all the attributes of **worker_age**

In [None]:
attributes(worker_age)

See, our **worker_age** vector's only attribute is names of items

**EXERCISE 3:**

Create an object named **seri_koz_getir** of the series of numbers 1 to 26

Name the values with letters of English alphabet (built in letters object will do that)

 1. First subset even values using indices (what is an easy hack to create such a sequence using `:` operator?)
 2. Then subset the elements named by the letters of "kardesim"
 3. Last, subset even numbers by boolean values. Note that a shorter boolean vector will be recycled to the length of the subsetted object

**SOLUTION 3:**

In [None]:
seri_koz_getir <- 1:26
names(seri_koz_getir) <- letters

seri_koz_getir[1:13 * 2]
seri_koz_getir[c('k', 'a', 'r', 'd', 'e', 's', 'i', 'm')]
seri_koz_getir[c(F, T)]

Note that, the `print()` function is used since evaluation of the code text only prints the last statement unless explicitly print function is used for other statements

In [None]:
?print

## Negative  subsetting and object modification

In [None]:
worker_age <- c(Emre = 25, Can = 52, Ceren = 33)
worker_income <- c(Emre = 36000, Can = 38500, Ceren = 40700)

Get in on `length()` function:

In [None]:
?length

Get the length of **worker_income** object

In [None]:
length(worker_income)

Get the names again:

In [None]:
names(worker_income)

See "Can" is the name of second item

Show the **worker_income** object w/o Can's income

In [None]:
worker_income[-2]

Check whether the **worker_income** object still has Can's income

In [None]:
worker_income

See, unless we make an assignment back to the object, it is not updated. Without assignment, R objects are "immutable", unchangeable

Now assign the **worker_income** vector without Can's income to **worker_income_2**

In [None]:
worker_income2 <- worker_income[-2]

Check **worker_income_2**

In [None]:
worker_income2

New create a copy of **worker_income** object as **worker_income_3**

In [None]:
worker_income_3 <- worker_income

In [None]:
worker_income_3

Now delete Can's income in **worker_income_3** object by excluding the item and assigning it back to the same vector.

Check the vector

In [None]:
worker_income_3 <- worker_income_3[-2]

In [None]:
worker_income_3

Now repeat the same asignment and check the vector

In [None]:
worker_income_3 <- worker_income_3[-2]

In [None]:
worker_income_3

Update Emre's income to 39000 (using "Emre"'s name) and check the vector:

In [None]:
worker_income_3["Emre"] <- 39000
worker_income_3

Insert Selin's age which is 27 at second position of **worker_age2** vector.

Check the vector:

In [None]:
worker_age <- c(worker_age[1], Selin = 27, worker_age[2:3])
worker_age

Create a copy of **worker_age** into **worker_age2**, append a new person with age 50 and name "Naim" by assigning to the 5th index (which does not exist yet!).

Check the vector:

In [None]:
worker_age2 <- worker_age
worker_age2[5] <- c(Naim = 50)
worker_age2

See, you can append an item to the end of the vector as if that item already exists and you update that value (However, the name is not updated in this manner)

To change the name to naim:

In [None]:
names(worker_age2)[5] <- "Naim"
worker_age2

Create a vector of boolean/logical values for whether age values are smaller than 40

In [None]:
worker_age < 40

Get workers which are younger than 40

In [None]:
worker_age[worker_age < 40]

Now assign the boolean values for `worker_age < 4` to object **bool_2**, and check the vector:

In [None]:
bool_2 <- worker_age < 40

Now get the workers younger than 40 using **bool_2**

In [None]:
worker_age[bool_2]

## Sorting

Sort worker's income in descending order

In [None]:
sort(worker_income, decreasing = TRUE)

In [None]:
?sort

See whether **worker_income** vector is sorted by that action

In [None]:
worker_income

Unless you make an assignment to that object, they are "immutable", the actions do not alter the objects

Sort worker's age in ascending order

In [None]:
sort(worker_age)

Get information on sort function

In [None]:
?sort

## Subset and modify

In [None]:
worker_age <- c(Emre = 25, Can = 52, Ceren = 33)
worker_income <- c(Emre = 36000, Can = 38500, Ceren = 40700)

Now let's do something more demanding: <p></p>Increase by 10% the income of workers who have an income of less than 40000. Check the vector

In [None]:
worker_income
worker_income[worker_income < 40000] <- worker_income[worker_income < 40000] * 1.1
worker_income

See the `* 1.1` operation is executed on all selected values (income less than 40K) at once. We do not have to loop over each value. This is called "vectorization" and the power of R comes from vectorization

## Ordering

In [None]:
worker_age <- c(Emre = 25, Can = 52, Ceren = 33)
worker_income <- c(Emre = 36000, Can = 38500, Ceren = 40700)

Remember **worker_age** vector again and get order of worker's ages

In [None]:
order(worker_age)

What does this output mean? Get info on order function

In [None]:
?order

Now remember age and income vector, get the order of ages again and sort incomes of workers by their age

In [None]:
worker_age
worker_income
order(worker_age)
worker_income[order(worker_age)]

Now do the sort in descending order of ages

In [None]:
worker_income[order(worker_age, decreasing = T)]

In [None]:
worker_income[order(-worker_age)]

**EXERCISE 4**

![Daltons](http://3.bp.blogspot.com/-ddgg1O7JjIo/UO59sy6GX_I/AAAAAAAAEyY/Y3Loj9ClpaI/s1600/Lucky-Luke-3-0QHGZIDENC-1024x768.jpg)

1. Create a vector **heights** with values 160 to 190 named for the guys above (use colon and multiplication)
2. Create a vector **weights** with values 45 to 60 and same names
3. Order the **heights** with the ascending order of **weights**
4. Order the **weights** vector with the descending order of **heights**

**SOLUTION 4:**

In [None]:
dalton_names <- c("Joe Dalton", "William Dalton", "Jack Dalton", "Averell Dalton")
heights <- c(16:19 * 10)
weights <- c(9:12 * 5)

names(heights) <- dalton_names
names(weights) <- dalton_names

heights
weights

order(heights, decreasing = T)
order(weights)

print(heights[order(weights)])
print(weights[order(heights, decreasing = T)])

## Random sampling, order() vs. rank()

![resim.png](attachment:resim.png)
by https://towardsdatascience.com/r-rank-vs-order-753cc7665951

Create a vector named **sample_1** which has randomly selected 10 numbers between 1 and 20 (without repetition). Check the vector

In [None]:
set.seed(1000) # this is here in order to reproduce the same "random" sample
sample_1 <- sample(1:20, 10, replace = F)
sample_1

Now first get the order of items and the rank of items. Are they different? Why?

In [None]:
order(sample_1)

In [None]:
rank(sample_1)

Get information on `rank()` function

In [None]:
?rank

Order gives the indices of items so that if we pick the items in that order they will be sorted

Rank gives the rank of each item if the vector were sorted

## Value types and coercion

Create a vector of integer values 1 to 10 and assign to **vector_1**. Get the values and class

In [None]:
vector_1 <- 1:10
vector_1
class(vector_1)

Now update 2nd item to string "cmpe". Get the vector again and its class

In [None]:
vector_1[2] <- "cmpe"
vector_1
class(vector_1)

A vector is an object that hold the values of the same type. And its type is dynamic: As the vector is updated with values of other types, the type/class of whole vector is changed: When only one value is changed to a character, the whole vector is "coerced" into a character vector

Now create a vector of lower case letters a to e and assign to **vector_2**. Check the vector and its class:

In [None]:
vector_2 <- letters[1:5]
vector_2
class(vector_2)

Now assign number 1 to 2nd item. Check the vector again and its class

In [None]:
vector_2[2] <- 1
vector_2
class(vector_2)

See, a number can act as a character but a character cannot act as a number. So the class of vector stays as character, it is not coerced into numeric after the assignment

## Indexing with a function

Now let's do some critical thinking!

First create a vector of values selected from 1 to 20 and with a random length, assign it to object **sample_2**

And then

1. get the last item of that vector

2. get all but the last item of that vector

Note that, you don't know the index of the last item before hand!

In [None]:
sample_2 <- sample(1:20, sample(1:20, 1))
sample_2

In [None]:
sample_2[length(sample_2)]

In [None]:
sample_2[-length(sample_2)]

Get info on `sample()` function

In [None]:
?sample

## Vectorized Functions

### get sums and cumulative sums with sum(), cumsum()

Create a vector of values as a sequence from 1 to 10 and save into **seq10** vector

In [None]:
seq10 <- 1:10
seq10

Now get the sum of the values in the vector:

In [None]:
sum(seq10)

See the sum of a vector yielded a vector of a single value

What if want to get a cumulative sum of values:

For each value n, the sum of all terms starting from the first term to nth term: 

In [None]:
seq10c <- cumsum(seq10)
seq10c

What if we want to get the sums starting from the last item?

We can use `rev()` function for that:

In [None]:
seq10r <- rev(seq10)
seq10r
cumsum(seq10r)

**EXERCISE 5:**

Starting from the cumulative sums vector **seq10c** how can we recreate the original vector easily?

Note that arithmetic operations on multiple vectors are conducted elementwise, or in a "vectorized fashion"

First try and understand this:

In [None]:
1:4 + 2:5

**SOLUTION 5:**

In [None]:
seq10c - c(0, seq10c)[-11]

## product of terms and cumulative products with prod() and cumprod()

Now let's multiply all items in a vector to yield a single values:

In [None]:
seq10 <- 1:10
seq10

prod(seq10)

Or cumulative product of all terms from the first to the nth term:

In [None]:
seq10cp <- cumprod(seq10)
seq10cp

**EXERCISE 6:**

Get the cumulative product of all terms from 5th to 10th using only **seq10cp**.

To check your solution the value has to be:

In [None]:
prod(5:10)

**SOLUTION 6:**

In [None]:
seq10cp[10] / seq10cp[4]

## NA vs NULL values

Create a vector of values from 1 to 5, check the class

In [None]:
vec5 <- 1:5
vec5
class(vec5)

Assign "NA" inside quotes to 5th element and check the class of the vector

In [None]:
vec5[5] <- "NA"
vec5
class(vec5)

"NA" inside quotes caused all values to be coerced to character since "NA" is a character

Now recreate vec5 and assign NA w/o quotes:

In [None]:
vec5 <- 1:5
vec5[5] <- NA
class(vec5)

The class is preserved. Now do the same for letters from a to e and assign NA to last item

In [None]:
vecae <- letters[1:5]
vecae
class(vecae)
vecae[5] <- NA
class(vecae)

So NA is class agnostic: It works on all types and preserves the class.

But what is the class of NA itself?

In [None]:
class(NA)

It is a logical value? Not **T**, not **F**, just "don't know"

Now assign **NULL** to fifth item of **vec5** and see what happens:

In [None]:
vec5[5] <- NULL

R does not allow us to replace a value with NOTHING

Let's have two objects: vecna which has a single NA and vecnull which has a single NULL
Check the length and class of each object

In [None]:
vecna <- NA
class(vecna)
length(vecna)

In [None]:
vecnull <- NULL
class(vecnull)
length(vecnull)

NA is sth that exists but details of which is unknown (class and values)

NULL is something that does not exist! It is NOTHING as Michael Corleone says:

[![My offer is this: Nothing](https://img.youtube.com/vi/KnmIoF_2Q4Y/0.jpg)](https://www.youtube.com/watch?v=KnmIoF_2Q4Y)

## Create sequences

seq() is used in order to create sequences with more control than the ":" operator:

```R
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
    length.out = NULL, along.with = NULL, ...)
```

- Create a sequence from 3 to 21 by steps of 3.5:

In [None]:
seq(from = 3, to = 21, by = 3.5)

- Create a sequence from 5.2, by steps of 1.7 and with a length of 14:

In [None]:
seq(from = 5.2, by = 1.6, length.out = 14)

- Create a sequence that ends in 34, by steps of 3.1 with a length of 20

In [None]:
seq(to = 34, by = 3.1, length.out = 20)

## %in% operator

%in% is a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.

- Create two sequences 
    - from 1 to 7
    - from 14 to 4

- Check whether each member of the first sequence in within the second one and vice versa
- Subset the vectors with those boolean values

In [None]:
s1 <- 1:7
s2 <- 14:4

s1 %in% s2
s1[s1 %in% s2]

s2 %in% s1
s2[s2 %in% s1]

## Repeat values

What is we want to create 20 TRUE values or 2e6 of them?

`rep()` function comes to help!

In [None]:
rep(1 == 1, 10)

In [None]:
rep(1:2, 5)

In [None]:
rep(1:2, each = 5)

A similar function that repeats the function you give is `replicate()`

In [None]:
replicate(10, 1 == 1)

In [None]:
replicate(5, seq(1,2))

Difference between rep and replicate apart from the difference in output is, former repeats the output of the repeated value and the latter repeats the function.

Equivelant operation for statements `rep(1 == 1, 5)` and `replicate(5, 1 == 1)` would be:

```r
rep(1 == 1, 5)        ~  c(T,      T,      T,      T,      T)
replicate(5, 1 == 1)  ~  c(1 == 1, 1 == 1, 1 == 1, 1 == 1, 1 == 1)
```

## all or any

Suppose we want to test the joint truthness of multiple logical values:

We may want to know whether ALL of them are **TRUE**:

In [None]:
logi1 <- c(rep(T, 100), F)
logi1
all(logi1)

The presence of even one FALSE, rendered the result of `all()` function as FALSE

Now let's get whether ANY of them are true:

In [None]:
logi2 <- c(rep(F, 100), T)
logi2
any(logi2)

The presence of even one TRUE, rendered the result of `any()` function as TRUE