# Introduction to R
A programming language designed for statistical computing and data visualization

## Variable assignment (two ways)

In [None]:
a = 3
b <- 5
print(a * b)

## Simple data types

### Boolean: TRUE, FALSE

In [None]:
truth_value = FALSE
print(!truth_value)
print(truth_value | TRUE)
print(truth_value & TRUE)

### Numbers: Integers and floating-point values

In [None]:
x = 2.0
y = 3
z = 3/4
print(x + y + z)

### Strings

In [None]:
txt = "This is CORE-1."
print(length(txt))
print(nchar(txt))

### Factors

In [None]:
a = factor("Condition 1")
b = factor("Condition 2")
print(a)
print(b)
print(as.numeric(a))
print(as.numeric(b))

## Loops, if-statements, functions
Same as in Python but slightly different syntax

In [None]:
i = 25
if (i > 3){
    print('yes')
} else {
    print('no')
}

In [None]:
square <- function(x){
    squared <- x * x
    return (squared)
}
square(10)

In [None]:
sequence = c(1, 2, 3, 4, 5)

for (variable in sequence){
    print(variable)
}

## Vectors

### Creating a vector

In [None]:
v1 <- c(10, 0, 0, 7, 6, 6, 2, 5) # Concatenate the numbers and store them in a vector
print(v1)

v1 <- c(v1, 3) # Add numbers to v1

In [None]:
v2 <- 2:10 # Integer sequence
print(v2)

In [None]:
v3 <- seq(2, 3, by=0.1) # More fine-grained sequence
print(v3)

In [None]:
rep(v1, times=3) # Repeat v1 3 times

In [None]:
rep(v1, each=3) # Repeat each element of v1 3 times

In [None]:
v4 <- rnorm(n=10, mean=0, sd=1) # Sample 10 numbers from a normal distribution with mean 0 and sd 1
print(v4)

### Operations on vectors

In [None]:
sort(v1) # Sort in increasing order
sort(v1, decreasing=TRUE)
print(v1) # Sorting not in-place!

In [None]:
unique(v1) # Remove duplicates

In [None]:
table(v1) # Get counts

In [None]:
sum(v1) # Sum all elements
mean(v1) # Average of all elements
var(v1) # Sample variance of all elements: Average of squared distance between each element and the mean, corrected by n/(n-1)
sd(v1) # Sample standard deviation: Square root of sample variance

In [None]:
sample(v1, size = 1) # Choose 1 element at random from the vector v1

In [None]:
cor(v1, v2) # Correlation between v1 and v2

### Creating a vector from another vector

In [None]:
v6 <- 1:10
v7 <- exp(v6)
v8 <- sapply(v6, exp)
print(v6)
print(v7)
print(v7 == v8)

### Accessing vector elements

By index

In [None]:
v1[4] # Fourth element
v1[-4] # All but the fourth
v1[2:4] # Elements 2, 3, and 4
v1[c(1, 5)] # Elements 1 and 5

By value

In [None]:
v1[v1 < 8 & v1 >= 3] # Select only elements that satisfy the condition
v1[v1 %in% v3]

## Basic plotting: Histograms

In [None]:
hist(v1)

In [None]:
plot(v6, v7, type="p")
plot(v6, v7, type="l")

In [None]:
plot(v6, v7, type="p")
lines(v6, v7) # Use 'lines' to force R to draw on the already-existing plot

### Exercise 1: The exponential function plotted above looks jagged (why?). Can you fix it?

### Exercise 2: Visualizing the Central Limit Theorem

#### The Central Limit Theorem (CLT): Overview
>The **sum (or average)** of a large number of **independent**, **identically distributed random variables** is **normally distributed**, regardless of the original distribution of the variables.*  

*To get even better intuitions about the CLT and for amazing math tutorials in general, I highly recommend this [video and channel](https://www.youtube.com/watch?v=zeJD6dqJ5lo) if you don't know it already.

#### Exercise 2.1.
To get some intuition about the content of the central statement in the CLT, let's start by writing a function that simulates a six-sided fair die (pl. *dice*).

In [None]:
roll_die <- function(){return ()}

#### Exercise 2.2.
Simulate the rolling of a die 100 times, store the values in a vector called `die_roll_outcomes`, and plot its histogram.
* Hint: `replicate(n, instruction)` executes instruction *n* times and places the outputs of the executions in a vector.

#### Exercise 2.3.
The histogram looks fairly uniform because this is actually how the random variable (values $1, \: 2, \dots, \: 6$) driven by a fair-sided die is distributed. What happens when we sum these values? Take the sum of the vector of 100 die rolls and sum it. What should the value be, on average?

#### Exercise 2.4.
Following the same logic as above, simulate 1,000 such sums, store them in a vector called `die_sum`, and draw the histogram.

Et voilÃ ! The normal distribution in all its beauty.  
#### Exercise 2.5.
Of course, the average of the die rolls will also be normally distributed because the average is just the sum divided by a constant, namely $\frac{1}{n}$.

Formally, if the expected value of the random variable X, $E[X]$, equals $\mu$, the expected value of the average of iid RVs is: 

$E\bigg[\frac{X_1 + X_2 + \dots + X_n}{n} \bigg] \overset{\text{by linearity of expectation}}= \frac{1}{n} \cdot E \bigg[X_1 + X_2 + \dots + X_n \bigg] = \frac{1}{n} \cdot \big( \underbrace{E[X_1]}_{= \mu} + E[X_2] + \dots + E[X_n] \big ) = \frac1n \cdot n \mu = \mu$

**What about the variance and standard deviation?**  
Suppose $\text{Var}(X) = \sigma^2$. Two facts about the variance of RVs are relevant for computing the variance of the average of $X_i\text{s}$.

First, the variance of the sum of independent random variables equals the sum of the variances:
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(X)$

Second, the variance of a random variable scaled by a constant is the variance of the RV multiplied the by the constant squared:
$\text{Var(aX)} = a^2\text{Var}(X)$

Therefore: $\text{Var}(X_1 + X_2 + \dots + X_n) = n\cdot\sigma^2$  
And: $\text{Var}\bigg(\frac{X_1 + X_2 + \dots + X_n}{n} \bigg) = \frac{1}{n^2}\text{Var}(X_1 + X_2 + \dots + X_n) = \frac{1}{n^2} \cdot n \cdot \sigma^2 = \frac{\sigma^2}{n}$

Since $\text{SD}(X) = \sqrt{\text{Var}(X)}, \: \text{SD}\Big(\frac{\Sigma_1^n X_i}{n} \Big) = \frac{\sigma}{\sqrt{n}}$

**What is the variance of a fair-sided die?**  
You might be tempted to think that it just equals `var(1:6)`, but that would not be accurate. Why?

In [None]:
var(1:6)

#### Exercise 2.6.
Write two functions: one that computes the **population** variance of a random variable (`pop_var`),
$\text{Var[X]} = E\Big[\big(X - E[X] \big)^2 \Big]$, and one that computes the population standard deviation (`pop_sd`).

In [None]:
die_var <- pop_var(1:6)
die_sd <- pop_sd(1:6)

die_var
die_sd

#### Exercise 2.7.
We have already seen that the average of a large number of RVs will be normally distributed with mean $\mu$. Let's check also check that the standard deviation of this normal distribution is equal to $\frac{\sigma}{n}$. Earlier, we simulated $n=100$ die-rolling events at a time, got the average of the 100 values obtained. We were able to visualize this normal distribution because we repeated this process many times (1000). We're now interested in its standard deviation. Write a function called `sd_mean_die_rolls` that does the following:
1. Generates $n$ rolls of a fair six-sided die 1,000 times.
2. Returns the standard deviation of this **sample**.

#### Exercise 2.8.
Create an integer-valued vector with values from 1 to 100 (inclusive) in increments of 2 and sweep your `sd_mean_die_rolls` function through it. 
- Hint: Use `sapply`

#### Exercise 2.9.
Plot the sample std_devs as a function of $n$.

#### Exercise 2.10
Which function of $n$ does the plot above approximate? Draw the plot in **Exercise 2.9** again, then plot this function on top of it using `lines(x, y)`.

#### Exercise 2.11 (Optional)
Human height is often used as an example of a variable that is normally distributed. Can you think of an explanation for why this should be so?