# Section 1: R Basics, Functions and Data Types

This notebook explores the basics of R.

---
## R Basics

The first objective of this notebook is to show the concept of object.

Suppose a high school student aks us for help solving a few quadratic equations of the form $ax^2 + bx + c = 0$. We know its solution is $x = \frac{-b \pm \sqrt{b^2 - 4ac}} {2a}$. One advantage of using programming languages is that we can keep expressions while changing variables.

For example, if we wanted to solve $x^2 + x - 1 = 0$, we could assign values to *a, b* and *c*. In R, the assignment symbol is __<-__:

In [1]:
a <- 1
b <- 1
c <- -1

You can see the values stored in each variable by just typing it and executing the line:

In [2]:
a

A more explicit way of asking R the value of a variable is by using the __print__ function:

In [3]:
print(a)

[1] 1


The term object is used to describe stuff that is stored in R. Variables are examples, but objects can also be more complicated entities such as functions.

When we define objects in the console, we are actually changing what is called the workspace. We can see all the objects saved in the workspace by using the function __ls__:

In [4]:
ls()

If we try to recover to value of a variable that is not in the workspace, we'll receive an error:

In [5]:
x

ERROR: Error in eval(expr, envir, enclos): object 'x' not found


Back to our example, if we want to obtain the solutions of the equation $x^2 + x - 1 = 0$, we can type the formula of the solution:

In [6]:
x1 <- (-b + sqrt(b ^ 2 - 4 * a * c)) / 2 * a
x1
x2 <- (-b - sqrt(b ^ 2 - 4 * a * c)) / 2 * a
x2

---
## Functions

R includes several predefined functions, such as __print__, __ls__ and __sqrt__. Even more functions can be added through packages. 

It is important to note the importance of the parentheses when calling a function. To get the code of a function, we can type its name withouth the parentheses, and to execute it we have to write the parentheses, as shown below:

In [7]:
ls

In [8]:
ls()

Most of the functions require an argument to be executed. For example, the __log__ funtion, which returns the natural log of a number:

In [9]:
log(8)
log(a)

In R, functions can be nested. In such cases, functions are evaluated from the inside out:

In [10]:
exp(1)
log(exp(1))

To learn more about a function, we can use the function __help__ and the name of the function we want to know better as its argument. It returns a help file, which s like a user manual for the functions:

In [11]:
help("log")

Another way of getting a help file is by typing __?__ followed by the name of the function: 

In [12]:
?log

To get a quick reminder of the arguments of a function, we can use the __args__ function:

In [13]:
args(log)

As we can see above, the default value of the base argument is 1. We can change it by typing:

In [14]:
log(8, base=2)

It was said, earlier in this notebook, that functions need parentheses to be executed, but there are some exceptions. Among these, the most commonly used are the arithmetic and relational operators, such as the power operator:

In [15]:
2^3


Functions are not the only pre-built objects in R. there are also datasets that are included for users to practice and test out functions. We can see all the availables datasets with the function __data__, and to access a dataset we can just type its name:

In [16]:
data()
CO2

Plant,Type,Treatment,conc,uptake
Qn1,Quebec,nonchilled,95,16.0
Qn1,Quebec,nonchilled,175,30.4
Qn1,Quebec,nonchilled,250,34.8
Qn1,Quebec,nonchilled,350,37.2
Qn1,Quebec,nonchilled,500,35.3
Qn1,Quebec,nonchilled,675,39.2
Qn1,Quebec,nonchilled,1000,39.7
Qn2,Quebec,nonchilled,95,13.6
Qn2,Quebec,nonchilled,175,27.3
Qn2,Quebec,nonchilled,250,37.1


There are other pre-built mathematical objects in R, such as the constant for pi and the infinity number:

In [17]:
pi
Inf

When defining variables, it is important to follow two basic rules:

- they have to start with a letter; and
- they can't contain spaces.

It is also important to avoid using names that are already predefined in R, suh as _pi_.

A nice convention to follow is to use meaningful words that describe what is scored, stick to lowercase and use underscores:

In [18]:
solution_1 <- (-b + sqrt(b ^ 2 - 4 * a * c)) / 2 * a
solution_1
solution_2 <- (-b - sqrt(b ^ 2 - 4 * a * c)) / 2 * a
solution_2

When we write our code, we're going to write them into scripts. For example, if we wanted to solve another quadratic equation, such as $3x^2 + 2x - 1 = 0$, we could just simply redefine _a, b_ and _c_. If we create this script, we don't have to rewrite everythong again, but just change these three variables.

In [19]:
a <- 3
b <- 2
c <- -1

Finally, another way to make our code more readable is by including comments. To do so, we just need to start a line with _#_:

In [20]:
# Define variables
a <- 3
b <- 2
c <- -1

# Compute solutions
solution_1 <- (-b + sqrt(b ^ 2 - 4 * a * c)) / 2 * a
solution_2 <- (-b - sqrt(b ^ 2 - 4 * a * c)) / 2 * a

# Print solutions
print(solution_1)
print(solution_2)

[1] 3
[1] -9


---
## Data types

Variables in R can be of different types. For example, we need to distinguish number from strings abd tables from simple lists of numbers. The function __class__ helps us determine the type of an object:

In [21]:
class(5)
class("hello")
class(ls)

The most common way of storing data in R is with _data frames_. Conceptually, we can think of data frames as tables. Rows represents observations and columns represents different variables. Data frames are particularly useful for datasets because we can combine different types into one single object.

To make it easier to understand, in the code cell below we install the __dslabs__ package, load the __murders__ dataset.

In [22]:
install.packages("dslabs")
library(dslabs)
data(murders)
class(murders)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


The function __str__ is useful when working with data frames because it shows us the structure of the object: 

In [23]:
str(murders)

'data.frame':	51 obs. of  5 variables:
 $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ abb       : chr  "AL" "AK" "AZ" "AR" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
 $ population: num  4779736 710231 6392017 2915918 37253956 ...
 $ total     : num  135 19 232 93 1257 ...


From the output of the code cell above, we can see that the object _murders_ is a data frama that contains 51 observations and 5 variables. We can also see the names and types of these variables, as well as some occurrances of them in the dataset.

Another way of getting the names of the columns is by using the __names__ function:

In [24]:
names(murders)

We can see the first lines of the data frame using the function __head__:

In [25]:
head(murders)

state,abb,region,population,total
Alabama,AL,South,4779736,135
Alaska,AK,West,710231,19
Arizona,AZ,West,6392017,232
Arkansas,AR,South,2915918,93
California,CA,West,37253956,1257
Colorado,CO,West,5029196,65


To access the values in a column (a variable), we must use the __$__ symbol. In the example below, we access the values of population in the murders dataset:

In [26]:
murders$population

It is important to know that the order of the entries in the vector 'murders$population' preserves the order of the rows in the data frame. As we will see, this will later permit us to manipulate one variable based on the results of another. To discover the number of entries in a vector we can use the function __length__, and to discover the data type of the entries in that vector we can use the function __class__:

In [27]:
pop <- murders$population
length(pop)
class(pop)

Besides numbers, it is possible to store characters in R too. Because variables also use character strings, we have to use quotes to distinguish between variable names and character strings.

In [28]:
a <- 1
a
"a"

When we used str(murders) above, we saw that the column _state_ contains characters:

In [29]:
class(murders$state)

Another type of vector is _logical vector_. We don't have any example of it in this particular data frame, but we need to mention them. Its values must be __TRUE__ or __FALSE__:

In [30]:
z <- 3 == 2
z
class(z)

There's one more data type we need to cover: _factor_. In the murders dataset, we have the column _region_, which some of us might think is of characters. This is not correct, as we can see in the output of str(murders).

Factors are useful for storing what is called categorical data, such as the regions of the US. We can see the four levels of a factor using the function __levels__:

In [31]:
levels(murders$region)

Saving categorical data as factors is more memory efficient: in the background, R stores integers (technically, integers are smaller memory-wise than characters).

However, factor are also a source of confusion, as they can easily be confused with characters. In general, it is recommended avoiding factors as much as possible, although they are sometimes necessary to fit statistical models that depend on categorical data