In [1]:
library(digest)

# ECON 325: Introduction to R

### Authors

- Colby Chambers (Reviewer)
- Anneke Dresselhuis (Reviewer)
- Oliver Xu (Developer)
- Jonathan Graves (Reviewer)

### Prerequisites: 
* Introduction to Jupyter

### Learning Objectives

- Understand variables, functions and objects in R
- Import and load data into Jupyter Notebook
- Access and perform manipulations of this data

In [2]:
# Run this cell

source("intro_to_r_tests.r")

# Part 1: Basics of Using R

In this notebook, we will introducing **R**, which is a programming language which is particularly well-suited for statistics, econometrics, and data science.  If you are familiar with other programming languages, such as Python, this will likely be very familiar - if this is your first time, don't be intimidated!  Try to play around with the examples and exercises as you work through this notebook; it's easiest to learn R (or any programming language) by playing around with it.

## Basic Data Types

To begin, it's important to get a good grasp of the different **data types** in R and how to use them.  Whenever with work with R, we will be manipulating different kinds of information, which is referred to as "data".  Data comes in many different forms, which define how we can use it in calculations or visualizations - these are called _types_ in R.

R has 6 basic data types. Data types are used to store information about a variable or object in R:

1. **Character**: data in text format, like "word" or "abc"
2. **Numeric** (real or decimal): data in real number format, like 6, or 18.8 (referred to as **Double** in R)
3. **Integer**:  data in an whole number (integer) format, like 2L (the L tells R to store this as an integer)
4. **Logical**: truth values, like TRUE, FALSE
5. **Complex**: data in complex (i.e. imaginary) format, like 1+6i (where $i$ is the $\sqrt{-1}$)
6. **Raw**: an unusual data type which we will not cover in this section

If we are ever wondering what kind of type an object in R has, or what its properties are, we can use the following two functions, which allow us to examine the data type and elements contained within an object:

* `typeof()`: this function returns a character string that corresponds to the data type of an object 
* `str()`: this function displays a compact internal structure of an R object 

We will see some examples of these in just a moment.

## Data Structures

Basic data is fine, but we often need to store data in more complex forms.  Data can also be stored in different structures in R beyond basic data types. Data structures in R programming are more complicated, as they are tools for holding multiple values. However, some of them are very important, and are worth discussing here. 

* **Vectors**: a vector of values, like $(1,3,5,7)$
* **Matrices**: a matrix of values, like $[1,2; 3,4]$ (usually displayed as a square)
* **Lists**: a list of elements, like $($pet = "cat, "dog", "mouse"$)$, with named properties
* **Dataframe**: a collection of vectors or lists, organized into rows and columns according to observations.

Note that vectors don't need to be numeric!  You can also use some useful built-in functions to create data structures (we don't have to create our own functions to do so).

* `c()`: this function combines values into a vector
* `matrix()`: this function creates a matrix from a given set of values
* `list()`: this function creates a list from a given set of values
* `data.frame()`: this function creates a data frame from a given set of lists or vectors

Okay, enough background - lets see this in action!

### Working with Vectors

Vectors are important, and we can work with them by creating them from values or other elements, using the `c()` function:

In [None]:
# generates a vector containing values
z <- c(1, 2, 3)

# generates a vector containing characters
countries <- c("Canada", "Japan", "United Kingdom")

We can also **access** the elements of the vector.  Since a vector is made of basic data, we can get those elements using the `[ ]` index notation.  This is very similar to how in mathematical notation we refer to elements of a vector.

> ***Note***: if you're familiar with other programming languages, it's important to note that R is 1-indexed.  So, the first element of a vector is 1, not 0.  Keep this in mind!

In [None]:
# If we want to access specific parts of the vector:

# the 2nd component of z
z[2]

# the 2nd component of countries
countries[2]

As mentioned above, we can use the `typeof()` and `str()` functions to glimpse the kind of data stored in our objects. 
Run the cell below to see how this works:

In [None]:
# view the data type of countries
typeof(countries)

# view the data structure of countries
str(countries)

# view the data type of z
typeof(z)

# view the data structure of z
str(z)

The output of `str(countries)` begins by acknowledging that the contained data is of a character (chr) type. The information contained in the `[1:3]` first refers to the component number (there is only 1 component list here) and then the number of observations (the 3 countries).

## Self Test: Vectors

In this exercise:

1. Generate an object `vect` which is a vector, which contains the numbers from 10 to 15. Hint you can use the `seq()` function in R to generate a sequence. 
2. Extract the 4th element of `vect` and store it in the object `answer1`

In [3]:
vect <- c(seq(from = 10, to = 15))

answer1 <- vect[4]

test_1()

[32mTest passed[39m 🌈
[1] "Success!"


### Working with Matrices

Just like vectors, we can also create matrices; you can think of them as organized collections of rows (or columns), which are vectors.  They're a little bit more complicated to create manually, since you need to use a more complex function.

The simplest way is just to provide a vector of all the values, then tell R how the matrix is organized; R will then fill in the values:

In [None]:
# generates a 2 x 2 matrix
m <- matrix(c(2,3,6,7,7,3), nrow=2,ncol=3)

print(m)

Take note of the order in which the values are filled it; it might be unexpected!

Just like with vectors, we can also access parts of the matrix.  If you look at the cell output above, you will see some notation like `[1,]` or `[,2]`.  These are the _rows_ and _columns_ of the matrix.  We can refer to them using this notation.  We can also refer to elements using `[1,2]`.  Again, this is very similar to the mathematical notation for matrices.

In [None]:
# If we want to access specific parts of the matrix:

# 2th column of matrix
m[,2]     

# 1st row of matrix
m[1,]  

#Element in row 1, column 2

m[1,2]


As with vectors, we can also observe and inspect the data structures of matrices using the helper function above.

In [None]:
# what type is m?

typeof(m)

# glimpse data structure of m
str(m)

The output of `str(m)` begins by displaying that the data in the matrix is of an numeric (num) type. The `[1:2, 1:3]` shows the structre of the rows and columns.  The final part displays the values in the matrix.

## Self Test: Matricies

In this exercise:

1. Create an object names `mat` which is a matrix with 2 rows and 2 columns. The first column will take on values 1,2, while the second column will take values 3,4.
2. Extract the first row, second column from `mat` and store it in the object `answer2`

In [4]:
mat <- matrix(c(1,2,3,4), nrow=2,ncol=2)
answer2 <- mat[1,2]

test_2()

[32mTest passed[39m 😀
[1] "Success!"


### Working with Lists

Lists are a little bit more complex, because they can store many different data types and objects, each of which can be given _names_ which are specific ways to refer to these objects.  Names can be any useful descriptive term for an element of the list.  You can think of lists like flexible vectors with names.

In [None]:
# generates a list with 3 components named "text" "a_vector" and "a_matrix"
my_list <- list(text="test", a_vector = z, a_matrix = m) 

We can access elements of the list using the `[ ]` or `[[ ]]` operations.  There is a difference:

* `[ ]` accesses the _elements of the list_ which is the name and object
* `[[ ]]` accesses the _object_ directly

We usually want to use `[[ ]]` when working with data stored in lists.  One very nice feature is that you can refer to elements of a list by number (like a vector) or by their name.

In [None]:
# If we want to access specific parts of the list:

# 1st component in list
my_list[[1]] 

#1st component in list by name (text)
my_list[["text"]]

# 1st part of the list (note the brackets)
my_list[1] 

# glimpse data type of my_list
typeof(my_list)

There is one final way to access elements of a list by name: using the `$` or **access** operator.  This works basically like `[[name]]` but is more transparent when writing code.  You put the object you want to access, followed by the operator, followed by the property:

In [None]:
# get the named property "text"
my_list$text

#get the name property
my_list$a_matrix

You will notice that this _only_ works for named object - which is particularly conventient for dataframes, which we will discuss next.

## Self Test: Lists

In this exercise, we will need to:

1. Create an object named `list_a`, which is a list with 2 components, a string "Hello World", and a vector with values 1 through 10.
2. Extract the second component, and store it in the object `answer3`

In [5]:
list_a <- list(World = "Hello World", Range = c(seq(from = 1,to = 10)))

answer3 <- list_a[[2]]

test_3()

[32mTest passed[39m 🎊
[1] "Success!"


### Working with Dataframes

Dataframes are the most complex object you will work with in this course, but also the most important.  They represent data - like the kind of data we would use in econometrics.  In this course, we will primarily focus on _tidy_ data, which refers to data in which the columns represent variables, and the rows represent observations.  In terms of R, you can think of data-frames as a combination of a matrix and a list.

We can access columns (variables) using their names, or their ordering

In [None]:
# generates a dataframe with 2 columns and 3 rows
df <- data.frame(ID=c(1:3),
                 Country=countries)

# If we want access specific parts of the dataframe:

# 2nd column in dataframe
df[2] 

df$Country

# glimpse compact data structure of df
str(df)

Notice that the `str(df)` command shows us what the names of the columns are in this dataset, and how we can access them.

## Self Test: Dataframes

In this exercise:

1. Create an object `df` which is a dataframe with two columns and two rows. The first column `var1` will take on values c(1,2). The second column `var2` will take on values c("A", "B"). 
2. Extract the column `var1`, take the mean, rounded to the nearest 2 decimal places, and store it in the object `answer4`

In [6]:
df <- data.frame(var1=c(1,2),var2=c("A","B"))
answer4 <- round(mean(df$var1), digits = 2)


test_4()

[32mTest passed[39m 🥳
[1] "Success!"


## Objects and Variables

At this point, you are familiar with some of the different types of data in R and how they work.  However, let's understand how we can work with them in more detail by writing R code.   A **variable** or **object** is a name assigned to a memory location in the R workspace (working memory). For now we can use the terms variable and object interchangeably. An object will always have an associated type, determined by the information assigned to it. Clear and concise object assignment is essential for **reproducible data analysis**, as mentioned in the module Intro to Jupyter.

When it comes to code, we can assign information (stored in a specific data type) to variables and objects using the **assignment operator** `<-`. Using the assignment operator, the information on the right-hand side is assigned to the variable/object on the left-hand side; we've seen this before, in some of the examples earlier.

In the example [2] below, `"Hello"` has been assigned to the object `var_1`. `"Hello"` will be stored in the R workspace as an object named `"var_1"`.

> **Important Note**: R is case sensitive. When referring to an object, it must _exactly_ match the assignment.  `Var_1` is not the same as `var_1` or `var1`

In [None]:
var_1 <- "Hello"

var_1

typeof(var_1)

You can create variables of many different types, including all of the basic and advanced types we discussed above.

In [None]:
var_2 <- 34.5 #numeric/double
var_3 <- 6L #integer
var_4 <- TRUE #logical/boolean
var_5 <- 1 + 3i #complex


## Self Test: Objects and Variables

In this exercise:

1. Create an object `answer` which has been assigned the string "Hello World"

In [7]:
answer5 <- "Hello World"


test_5()

[32mTest passed[39m 🥇
[1] "Success!"


## Operations

In R, we can also perform **operations** on objects; the type of an object defines what operations are valid. All of the basic mathematical and logical operations you are familiar with are example of these, but there are many more.  For example:

In [None]:
a <- 4 # creates an object named "a" assigned to the value: 4
b <- 6 # creates an object named "b" assigned to the value: 6
c <- a + b # creates an object "c" assigned to the value (a = 4) + (b = 6)

> Try and think about what value c holds!

We can view the assigned value of `c` in two different ways:
1. By printing `a + b`\
OR
1. By printing `c`

Run the code cell below to see for yourself!

In [None]:
a + b
c

It is also possible to change the value of our objects. In the example below, the object `b` has been reassigned the value 5.

In [None]:
b <- 5 

R will now store the updated value of 5 in the object `b`. This overrides the original assignment of 6 to `b`. The ability to change object names is a key benefit using variables in R. We can simply reassign the value to a variable without having to change that value everywhere in our code. This will be quite useful when we want to do things such as change the name of a column in a dataset.

> ***Tip:***  Remember to use a unique object name that hasn't been used before in order to avoid unplanned object reassignments when creating a new object.  Descriptive names are better!

## Self Test: Operations

In this exercise:

1. create an object `u` which is equal to 1
2. create an object `y` which is equal to 7
3. create an object `w` which is equal to 10.
4. create an object `answer6` which is equal to the sum of `u` and `y`, divided by `w`

In [8]:
u <- 1
y <- 7
w <- 10

answer6 <- (u + y)/w


test_6()

[32mTest passed[39m 🌈
[1] "Success!"


## Comments

While developing our code, we do not always have to use markdown cells to document our process. We can also write notes in code cells using something called a **comment**. A comment simply allows us to write lines in our cell which will not run when we run the cell itself. By simply typing the `#` sign, anything written directly after this sign and on the same line will not run; it is a comment. To comment out multiple lines of code, simply include the `#` sign at the start of each line. 

> In general, the purpose of comments is to make the source code easier for readers to understand. 


It is important to comment on your code for three main reasons:

1. **Keep track** of your actions and thought process: Commenting is a great way to help you stay organized. Code comments provide an ordered process for everyone to follow. In case we need to debug our codes, we can easily track which step is problematic and come back to that particular line of code.

2. **Help readers understand** why you're coding this way: While coding something like `a + b` may be a more or less straightforward computation, When writing code, certain methods are pretty obvious. For example, the previous example  is obvious, even to someone who have no programming experience. However, you are going to develop your own project eventually. In your project, although the functionality of the method `a + b` is obvious, it may be unclear what this operation's purpose is in your project. Your readers or other developers may ask: why did you use addition instead of multiplication or division? With comments, you can explain why this particular method was used for this particular code block and how it relates to other code blocks. 

3. It **saves everyone's time** in the future, including yourself: It's far easier than you might expect to forget what a pierce of code does, or is support to do.  Keeping good comments ensures that your code remains comprehensible

> ***Tip***: an old woodworker's tip is when you take things apart you should label them so that a stranger could put them back together.  The same advice applies to comments and coding: write code so that a stranger could figure out what it is supposed to do.

Generally, it is always a good idea to add comments to your code. However, if you find yourself needing to explain a block of code in great detail in terms of its purpose in the project, it is preferable to use a markdown cell instead. Comments are best served for the reasons above.

## More on Operators

Earlier, we used discussed operations and used the example of `+` to run the addition of a and b. `+` is a type of R arithmetic **operator** which means a symbol that tells R to perform a specific operation. We can use different R operators with variables. R has 4 types of operators:

1. **Arithmetic operators**: used to carry out mathematical operations. Ex. `*` for multiplication, `/` for division, `^` for exponent etc.
1. **Assignment operators**: used to assign values to variables. Ex. `<-` 
1. **Relational operators**: used to compare between values. Ex. `>` for greater than, `==` for equal to, `!=` for not equal to etc.
1. **Logical operators**: used to carry out Boolean operations. Ex. `!` for Logical NOT, `&` for Logical AND etc.

We won't cover all of these right now, but you can look them up online; for now, keep an eye out for them when they occur.

## Functions

These simple operations are great to start with, but what if we want to do operations on different values of X and Y over and over and don’t want to constantly rewrite this code? This is where **functions** come in. Functions allow us to carry out specific tasks. We simply pass in a parameter or parameters to the function. Code is then executed in the function body based on these parameters, and output may be returned.

In [None]:
# Functionname <- function(arguments)
#  {code operating on the arguments
#   }

This structure says that we start with a name for our function (`Functionname`, here) and we use the assignment operator similarly to when we assign values to variables. We then pass **arguments or parameters** to our function (which can be numeric, characters, vectors, collections such as lists, etc.); think of them as the _inputs_ to the function. 

Finally, within the curly brackets we write our code needed to accomplish our desired task. Once we have done this, we can call this function anywhere in our code (after having run the cell defining the function!) and evaluate it based on specific parameter values. 

An example is shown below; can you figure out what this function does?

In [None]:
my_function <- function(x, y)
 {x = x + y
 2 * x
}

The parameters input to functions can be given **defaults**. Defaults are specific values for parameters that have been chosen and defined within the circular brackets of the function definition. For example, we can define `y = 3` as a default in our `my_function`. When we call our function, we then do not have to specify an input for `y` unless we want to.

In [None]:
my_function <- function(x, y = 3)
 {x = x + y
 2 * x}

my_function(2)

However, if we want to override this default, we can simply call the function with a new input for `y`. This is done below for `y=4`, allowing us to execute our code as though our default was actually `y=4`.

In [None]:
my_function <- function(x, y = 4)
 {x = x + y
  2 * x}

my_function(2, 4)

Finally, note that we can **nest** functions within functions, meaning we can call functions inside of other functions - creating very complex arrangements. Just be sure that these inner functions have themselves already been defined. 

In [None]:
my_function_1 <- function(x, y)
 {x = x + y + 2
  2 * x}

my_function_2 <- function(x, y)
 {x = x + y - my_function_1(x, y)
  2 * x}

my_function_2(2, 3)

Luckily, we usually don't have to define our own functions, since most useful built-in functions we need already come with R. They do not require creation; it already exists for us, although it may require the importing of specific packages. We can always use the help `?` feature in R to learn more about a built-in function if we're unsure. For example, `?max` gives us more information about the `max()` function.

For more information about how you should read and use different functions, please refer to the [Function Cheat Sheet](https://cran.r-project.org/doc/contrib/Short-refcard.pdf). 

## Self Test: Functions

In this exercise:

1. Create a function `divide` which takes in two arguments, `x` and `y`. The function should return `x` divided by `y`. 
2. Store the solution to divide(5,3), rounded to the nearest 2 decimal places, in the object `answer7`:

In [9]:
divide <- function(x,y) {
    sol <- x/y
    return(sol) 
    }

# Your code goes here

answer7 <- round(divide(5,3), digits = 2)


test_7()

[32mTest passed[39m 😀
[1] "Success!"


## Oops!  Dealing with Errors 
Sometimes in our analysis we can run into errors in our code; it happens to everyone, and is not a reason to panic. Understanding the nature of the error we are confronted with can be a helpful first step to finding a solution. There are two common types of errors:

* **Syntax errors**: This is the most common error type. They result from invalid code statements/structures that R doesn’t understand. Suppose R speaks English, asking it to help by speaking German or broken English certainly would not work! Here are some examples of common syntax errors: the associated package is not loaded, misspelling of a command as R is chase-sensitive, unmatched parenthesis etc. How we handle syntax errors is case-by-case: we can usually solve syntax errors by reading the error message and googling it.

* **Semantic errors**: They result from valid code that successfully executes but produces unintended outcomes. Again, let us suppose R speaks English. Although you asked it to hand you an apple in English and R successfully understood, but then handed you a banana! This is not okay! How we handle semantic errors is also case-by-case: we can usually solve semantic errors by reading the error message and googling it.

Now that we have all of these terms and tools at our disposal, we can begin to load in data and operate on it using what we’ve learned.

## Wrapping Up

In this notebook, we have learned 

# Part 2: Working with Data

In general, we need to prepare two fundamental tools before working with any datasets in R. 

1. R development environment (R, RStudio or R Notebook etc.)
2. A collection of R packages 

We already know how to work in R Notebook. Now we will briefly walk through the second point: understanding R packages and how we can use them. 

**R packages** are simply collections of recallable R functions, data and compiled codes, and the associated documentation which describes how we can use them. Different packages add different functionalities. Even now, many R users are actively contributing new packages. When we open our R notebook, it already contains the base package that allows R notebook to operate. The **R library** is where all the packages are stored in our working environment. 

Depending on what our programming goal is, we will need to load different packages. Suppose we need a specific functionality (say data manipulation) and we want to load an R package with such functionality (*tidyverse*). We should first download that R package from the site, which will then be stored in our library. Suppose we want to download the *tidyverse* package from the site, then we should type this command: `install.packages("tidyverse")`.

In [None]:
# Load revalent R packages for the following analysis

# the tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages, and they are available to use.
library(tidyverse) 

# the haven allows you to load foreign data formats (SAS, Spss and Stata) in to R
library(haven)

While running our R codes, sometimes we may encounter errors or warnings. An error message indicates our code is not being used in the way it is designed to be used, while a warning message signals a warning condition while executing our codes. This is totally fine and we can address/fix both! 

Errors are the worst because they mean that your code cannot function properly and it must be fixed. On the other hand, warnings are much better than errors. Although warnings indicate that there is something wrong with your code, this is not necessarily a bad thing. Typically, you don't have to "fix" warning messages when you are reasonably certain you will already receive the desired output from your code. In that case, you may choose to ignore warning messages.

For example, R constantly updates names for its functions. In the newest version of R, sometimes an older function may have a new name. The older function can still work, but it may generate a warning message. If you do not want to use the new function, then ignoring that warning message is fine.

In contrast, take a look at the example below. This is a warning message you should not ignore because the code is not producing your desired output. The square root of -2 is commonly denoted as an imaginary number rather than a number in mathematics. The `sqrt()`function returns a nan value because it does not have the ability to handle imaginary numbers. In this case, the problem is not with your code and thus it is not an error. How we handle warning messages is case-by-case. You can usually solve the problem by googling the solution online.

In [None]:
# Take the square root of -2 by using built-in function sqrt()
# It generates a warning message because it does not have the ability to handle imaginary numbers
sqrt(-2)

#### Importing the Dataset 

Data comes in many different formats. For example, data files with a `.dta` suffix are data from the statistical software STATA,  data files with a `.csv` suffix are data in comma-separated variable format, and data files with an `.Rda` suffix are data from the statistical software R.

When it comes to loading data, R uses different function calls for different data formats. Below are the three most common data formats used in economics.

1. `read.csv()`: reads a CSV file into memory
2. `read_dta()`: reads a Stata DTA file into memory
3. `load()`: reads an R Rda file into memory                

Therefore, it is critical to call the correct R function based on the suffix of our data file. Suppose we want to load the datafile `01_census2016` in three different formats: `01_census2016.csv`，`01_census2016.dta`, and `01_census2016.Rda` respectively. Our code should be:

1. `read.csv(01_census2016.csv)`
2. `read_dta(01_census2016.dta)`
3. `load(01_census2016.Rda)`

Now let's load in our data file. We will load in a dataset which captures information from the 2016 Canadian census administered by Statistics Canada.

In [None]:
# read the Stata DTA file "01_census2016.dta" into memory
# assign it to the R object "census_data"
census_data <- read_dta("01_census2016.dta")

The dataset `census_data` is in **tidy** format. This means that the data is rectangular, or more specifically, that each row represents an observation and each column represents a variable. In other words, each variable has its own column, each observation has its own row and each value has its own cell. Our tidy dataset is also easier to work with inside the package `tidyverse` we are currently using. All packages in the `tidyverse` (i.e. `dplyr`) are designed to work specifically with tidy datasets.

#### Exploring the Dataset 

Now have imported our dataset. It is a "new" dataset to us. To better understand our dataset as a whole, the natural next step is to view and explore it. We always do this immediately after loading in our data to make sure there are no weird features or problems with it. Here are some basic R commands to investigate data:
1. `head(...)`: View the first 6 rows of a dataset
2. `tail(...)`: View the last 6 rows of a dataset
3. `dim(...)`: Find the dimensions of a dataset                 
4. `str(...)`: Understand the structure of a dataset     
5. `colnames(...)`: View all the variable names of a dataset

In [None]:
# See the first 6 rows
head(census_data)

In [None]:
# See the last 6 rows
tail(census_data)

In [None]:
# Find the dimensions
dim(census_data)

Here we can see that the dimensions of our dataset are 391938 by 16, meaning that we have 391938 observations and 16 variables in total. Whoa! That's a lot of data!

In [None]:
# Understand the structure
str(census_data)

In [None]:
# Find the variable names
colnames(census_data)

#### Accessing the Variables of the Dataset

There are many variables in this dataset! Where should we begin? **Suppose we are interested in analyzing the average wage gap between men and women.**  

>We note that gender and sex are used synonomously in this dataset in a binary class relation, although in real life gender is best understood as a existing on a spectrum.\
\
Gender exists separately from the sex assigned at birth and there is diversity in how
different people experience, understand, and express gender, and how gender is
institutionalized within a society (1). In econometrics, it's important for us to consider how the decisions we make to encode and classify information can have have potentially harmful or exclusionary outcomes.
\
\
(1): _Ahluwalia, A., Bean, C., Chachram, M., Chan, R., Cheang, R., Cheng, S., Chindea, V., Conrad-Kilgallen, K., Cooey-Hurtado, L., Correa, K., Dresselhuis, A., James, E., Kim, C., Knight, K., Mozolevych, N., Poissant, J., Sarvini, J., Tomar, B., Traboulay, A. (2022). Community-Based Research & Data Justice Resource Guide. Gender + in Research Collective at ORICE UBC, Vancouver, BC._ 

As we begin our analysis, we can see that there are two variables important to our analysis: `sex` and `wages`.

* `sex`: the sex of the individual
* `wages`: the annual wage of the individual

To access `sex` and `wages` in the dataset, we can use the access operator `$` to access specific variables. For example, `x$y` refers to the access of variable `x` from the dataset `y`

* `census_data$sex`: access the variable `sex` from the dataset `census_data`
* `census_data$wages`: access the variable `wages` from the dataset `census_data`

In [None]:
# access the "sex" variable and assign it to the R object "sex"
sex <- census_data$sex

# access the "wages" variable and assign it to the R object "wages"
wage <- census_data$wages

Now we have extracted two variables from the dataset. The natural next step is to filter out the missing values or NAs before we do further analysis. “NA” stands for “Not Available”. It is quite common to see datasets with NA observations, as there's no perfect dataset in real-life. For example, an unemployed individual may have nothing to report for their annual wages, thus leaving the data cell unavailable or marked as NA. We can first use the R function `is_na()` to check if there's any NAs, and then use the R function `drop_na()` to get rid of these NA observations.

In [None]:
# check if there's any NA's in variable "sex"
any(is.na(sex))

In this dataset, it appears that every individual has been classified as either `man` or `woman`.

In [None]:
# check if there's any NA's in variable "wage"
any(is.na(wage))

Alas, we do have a few NA values for the variable `wages`. Let's get rid of them now!

In [None]:
# we redfine the census_data so that it is cleaned up without any missing values in wages
census_data <- drop_na(census_data, wages)

Note here in the `drop_na()`, we didn't use `$` to access the wages variable. The function `drop_na(data, ...)` takes in a dataset as its first argument, then takes in a variable and searching for missing values within that corresponding column.

Next we would like do some manipulation on the variable `sex`. Since we are focusing on the sex group only (to calculate the mean wage of men and women separately), let's use a new built-in function called `group_by()`. This function takes in a specified dataset and groups by one or more variables we choose. Once grouped, we can perform operations "by group" (i.e. by sex).

In [None]:
# we redfine the census_data so that it is now grouped by sex
census_data <- group_by(census_data, sex)

Finally, we calculate the mean wage of each group. The `summarise()` function is typically used on grouped data created by `group_by()`. `summarise()` takes in a dataset and allows us to define a new column name `meanWage` to be the `mean()` of `wages`. For output, it creates a new data frame with one (or more) rows for each combination of grouping variables (men & women), along with columns indicating our groups of focus and our newly defined variable.

In [None]:
summarise(census_data, meanWage = mean(wages))

From this table, we can clearly see the difference in average wages between men and women!

That process required many steps and many blocks of code. Luckily, there is a critical tool in R that will expedite this process: `%>%` is called the **forward pipe operator** in R. It allows us to forward the result of a function/expression into the next function/expression. We can think of the pipe as saying "and then", allowing us to pipe previous results into upcoming functions in a sequence.
It is especially helpful when creating complex code with many parentheses. Through piping, our code is more readable both to others and ourselves. Let's take a look at a piping example below.

We will pipe the dataset through following methods:
* eliminate NA values from `wages`
* group the dataset by `sex`
* create a new data frame with mean wages for each sex group

In [None]:
# Our data is first processed and then passed into the next function (so on and so on...)
# The result of the entire sequence is stored in the tibble named 'table'

table <- census_data %>% 
    group_by(sex) %>% 
    summarise(meanWage = mean(wages))

table

We see that the output here is the same as before! Now, you might be wondering: shouldn't it be fine not use this piping approach so long as our code gets the job done? From the consequentialist perspective, yes, this is fine. However, without piping, our code will be hard to debug if a problem arises; this will be magnified when we have hundreds of lines of code with many variable repetitions or parentheses close together. Let's take a look at two examples which do not use the piping approach to illustrate this point.

In [None]:
# Approach 1 (no piping): Store each step in the process sequentially
table <- drop_na(census_data, wages)
table <- group_by(table, sex)
table <- summarise(table, meanWage = mean(wages))
table

Approach 1 gets the job done, but it is extremely inefficient. The code has many unnecessary variable repetitions, which makes it hard for us to follow what is actually changing in each line.

In [None]:
# Approach 2 (no piping): Consolidate all functions with many parentheses
table <- summarise(group_by(drop_na(census_data), sex), meanWage = mean(wages))
table

Approach 2 also gets the job done, but it is also inefficient and even harder to read. Compared to Approach 1, it becomes much harder to see the order of steps in the analysis.

In both cases, following the piping approach we demonstrated first is preferable for two reasons:
1. The piping reduces repetitive typing
2. The piping is the most straightforward syntax for humans to read and understand

Piping automatically passes the output from the first line as input into the next line. Therefore, it is the clearest form of syntax available, as it focuses on actions, not objects. Humans focus on actions and computers focus on objects. The computer can process your code regardless of its format because it's not hard for machines to read, yet this is not intuitive for human readers. If we make a mistake, the debugging process will be painful for us; we are not robots.

### Part 3: Wrapping up & Exercises ###

With this introduction, you have learned the basics of coding in R. You now understand how variables, objects and functions work, how to load different formats of data into R, and how to operate on data using the basic techniques presented. This lesson will be indispensable to you as you continue on in the course and delve further into the world of econometrics and data analysis/visualization. For now, here are some exercises to test your understanding!

#### Excercise 1

In [None]:
x <- 2
y <- 3
z <- ((x + y) ^ x) * y

z

# store the value you think z will be in ''answer_1'' by completing this code

answer_1 <- ...  # enter your numerical answer here

test_1()

#### Excercise 2

In [None]:
my_function <- function(x, y)
 {z = x + y
 2 * z
}

b <- my_function(2, 3)

answer_2 <- ... # your numerical answer for b here

test_2()

#### Excercise 3

Using the same census dataframe, use piping to calculate the `mean`, `max`, and `min` wages for each group in the variable `pkids`, then save these results in a dataframe called `tibble`. The mean, max and min variables must be labelled as `meanwage`, `maxwage`, and `minwage` in that order for full marks. 

In [None]:
tibble <- census_data %>%
            # your code here

answer_3 <- tibble # your answer for here

test_3()