# ECON 490: Opening Datasets (4)

## Prerequisites
---
1. Understand the basics of R such as data types and structures.


## Learning Outcomes
---

After completing this notebook, you will be able to:

1. Load a variety of data types into R using various functions
2. View and reformat variables, specifically by factorizing
3. Properly address how to work with missing data
4. Select subsets of observations and variables for use

## 4.1 The Data Analysis Procedure
In this notebook, we will focus on loading, viewing and cleaning up our data-set: these are **fundamental** skills which will be necessary for essentially every data project we will do.  This process usually consists of three steps:

1. We load the data into R, meaning we take a file on our computer and tell R how to interpret it.
2. We inspect the data through a variety of methods to ensure it looks good and was properly loaded.
2. We clean up the data by removing missing observations and adjusting the way variables are interpreted.

In this module, we will cover each of these three steps in detail. Let's start by looking at the loading process.

## 4.2 Loading Data
Remember, before we can load the data we need to tell R what packages we will be using in the notebook. Without these packages, R will not have access to the appropriate functions needed to interpret our raw data. As explained previously, packages only need to be installed once; however, they need to be imported every time we open a notebook.

We have discussed packages previously: for data loading, the two most important ones are `tidyverse` and `haven`.
* `tidyverse` should already be somewhat familiar. It includes a wide range of useful functions for working with data in R.
* `haven` is a special package containing functions that can be used to import data.

Let's get started by loading them now.

In [1]:
# loading in our packages
library(tidyverse)
library(haven)

-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.5     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.5     [32mv[39m [34mdplyr  [39m 1.0.7
[32mv[39m [34mtidyr  [39m 1.1.4     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 2.0.2     [32mv[39m [34mforcats[39m 0.5.1

-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Data can be created by different programs and stored in different styles - these are called **file types**. We can usually tell what kind of file type we are working with by looking at the extension.  For example, a text file usually has an extension like `.txt`.  The data we will be using in this course is commonly stored in Stata, Excel, text, or comma-separated variables files.  These have the following types:

* .dta for a Stata data file
* .xls or .xlsx for an Excel file
* .txt for a text file
* .csv for a comma-separated variables file

To load any dataset, we need to use the appropriate function in order to specify to R in which format the data is stored. 

- To load a .csv file we use the command `read_csv("file name")`
- To load a STATA data file we use the command `read_dta("file name")`
- To load an Excel file we the command `read_excel("file name")`
- To load a text file we use the command `read_table("file name", header = FALSE)`.
  - The header argument specifies whether or not we have specified column names in our data file. 
  
> **Note:** if we are using an Excel file, we need to load in the package `readxl` alongside the `tidyverse` and `haven` packages above to read the file.

In this module, we will be working with a simulated dataset on worker information over many years, and their participation in a training program to boost their earnings.  The file name is "fake_data.csv" and "fake_data.dta" (same data but in different formats). Let's read in our data in csv format now.

In [5]:
# reading in the data
fake_data <- read_csv("../econ490-stata/fake_data.csv")  #change me!

[1mRows: [22m[34m2861772[39m [1mColumns: [22m[34m9[39m

[36m--[39m [1mColumn specification[22m [36m------------------------------------------------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): sex
[32mdbl[39m (8): workerid, year, birth_year, age, start_year, region, treated, earnings


[36mi[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



Rows: 2,861,772
Columns: 9
$ workerid   [3m[90m<dbl>[39m[23m 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 9,~
$ year       [3m[90m<dbl>[39m[23m 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 2009,~
$ sex        [3m[90m<chr>[39m[23m "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",~
$ birth_year [3m[90m<dbl>[39m[23m 1944, 1944, 1947, 1947, 1947, 1951, 1951, 1952, 1952, 1954,~
$ age        [3m[90m<dbl>[39m[23m 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, 49,~
$ start_year [3m[90m<dbl>[39m[23m 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 1998,~
$ region     [3m[90m<dbl>[39m[23m 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, 2,~
$ treated    [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ earnings   [3m[90m<dbl>[39m[23m 39975.010, 278378.100, 18682.600, 293336.400, 111797.300, 8~


## 4.3 Viewing Data
Now that we've loaded in our data, it's important to inspect the data. Let's cover a series of commands which help us to do this.

#### 4.3.1 `glimpse`

The first command we are going to use describes the basic characteristics of the variables in the loaded data set.

In [None]:
glimpse(fake_data)

Alternatively, we can use the `print` command, which displays the same information as the `glimpse` command but in horizontal form.

In [None]:
print(fake_data)

With many variables, this can be harder to read than the `glimpse` command. Thus, we typically prefer to use the `glimpse` command.

#### 4.3.2 `View` and `head`

In addition to use the `glimpse` command, in R Studio we can also open our data editor and see the raw data we have imported as if it were an Excel file. To do so we can use the `View` function. This command will open a new tab with an interactive representation of our data. We can also use the command `head`. This prints out by default the first ten rows of our dataset exactly as it would appear in Excel. We can then specify numeric arguments to the function to increase or decrease the number of rows we want to see, as well as the specific rows we want via indicating their positions.

In [None]:
head(fake_data)

There is even the function `tail`, which functions identically to `head` but works from the back of the dataset (outputs the final rows).

In [None]:
tail(fake_data)

Opening the data editor has many benefits. Most importantly we get to see our data as a whole, allowing us to have a clearer perspective of the information the dataset is providing us. For example, here we observe that we have unique worker codes, the year where they are observed, worker characteristics, and whether or not they participated in the training program. This is particularly useful when we first load a dataset, since it lets us know if our data has been loaded in correctly and looks appropriate.

#### 4.3.3 `summary` and `sapply`

We can further analyze any variable by using the `summary` command. This commands gives us the minimum, 25th percentile, 50th percentile (median), 75th percentile, and max of each our variables, as well as the mean of each of these variables. It is a good command for getting a quick overview of the general spread of all variables in our dataset.

In [None]:
summary(fake_data)

We can also apply summary to specific variables.

In [66]:
summary(fake_data$earnings)

ERROR: Error in data$earnings: object of type 'closure' is not subsettable


If we want to quickly access more specific information about our variables, such as their standard deviations, we can supply this as an argument to the function `sapply`. It will output the standard deviations of each of our numeric variables. However, it will not operate on character variables. Remember, we can check the type of each variable using the `glimpse` function from earlier.

In [None]:
sapply(fake_data, sd)

We can also apply arguments such as mean, min, and median to the function above; however, sd is a good one since it is not covered in the `summary` function.

#### 4.3.4 `count` and `table`

We can also learn more about the frequency of the different measures of our variables by using the command `count`. We simply supply a specific variable to the function to see the distribution of values for that variable.

In [None]:
count(fake_data, region)

Here we can see that there are five regions indicated in this data set, that more people surveyed came from region 1 and then fewer people surveyed came from region 3. Similarly, we can use the `table` function and specify our variable to accomplish the same task.

In [None]:
table(fake_data$region)

## 4.4 Cleaning Data
Now that we've loaded in our data, the next step is to do some rudimentary cleaning of our data. This most commonly includes factorizing variables and dropping missing observations.

#### 4.4.1 Factorize variables

We have already seen that there are different types of variables which can be stored in R. Namely, there are quantitative variables and qualitative variables. Any quantitative variable can be stored in R as a set of strings or letters. These are known as **character** variables. Qualitative variables can also be stored in R as factor variables. Factor variables will associate a qualitative response to a categorical value, making analysis much easier. Additionally, data is often encoded, meaning that the levels of a qualitative variable have been represented by "codes", usually in numeric form.

Look at the *region* variable in the output from `glimpse` above:

```
region     <dbl> 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, 2,~
```

The *region* variable in this dataset corresponds to a particular region that the worker is living in. We could also see the variable type is <dbl+lbl>: this is a labeled double. This is good: it means that R already understands what the levels of this variable mean.

There are three similar ways to change variables into factor variables. 

1.  We can change a specific variable inside a dataframe to a factor by using the `as_factor` command

In [10]:
fake_data <- fake_data %>%  #we start by saying we want to update the data, AND THEN... (%>%)
    mutate(region = as_factor(region)) #mutate (update) pr to be a factor variable

glimpse(fake_data)

Rows: 2,861,772
Columns: 9
$ workerid   [3m[90m<dbl>[39m[23m 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 9,~
$ year       [3m[90m<dbl>[39m[23m 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 2009,~
$ sex        [3m[90m<chr>[39m[23m "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",~
$ birth_year [3m[90m<dbl>[39m[23m 1944, 1944, 1947, 1947, 1947, 1951, 1951, 1952, 1952, 1954,~
$ age        [3m[90m<dbl>[39m[23m 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, 49,~
$ start_year [3m[90m<dbl>[39m[23m 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 1998,~
$ region     [3m[90m<fct>[39m[23m 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, 2,~
$ treated    [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ earnings   [3m[90m<dbl>[39m[23m 39975.010, 278378.100, 18682.600, 293336.400, 111797.300, 8~


Do you see the difference in the _region_ variable?  You can also the type has changed to <fct> for **factor variable**.

If and only if the type was <dbl+lbl>, R would already know how to "decode" the factor variables from the imported data.  But what if that wasn't the case?  This brings us to the next method.

2.  We can **supply a list of factors** using the `factor` command.   This command takes two other values:
    * A list of levels the qualitative variable will take on
    * A list of labels, one for each level, which describes what each level means
    
We can create a custom factor variable as follows:

In [17]:
#first, we write down a list of levels
region_levels = c(1:5)
#then, we write down a list of our labels
region_labels = c('Region A', 'Region B', 'Region C', 'Region D', 'Region E')

#now, we use the command but with some options - telling factor() how to interpret the levels

fake_data <- fake_data %>%  #we start by saying we want to update the data, AND THEN... (%>%)
    mutate(region2 = factor(region,   #notice it's factor, not as_factor
                          levels = region_levels, 
                          labels = region_labels)) #mutate (update region) to be a factor of regions
glimpse(fake_data)

Rows: 2,861,772
Columns: 10
$ workerid   [3m[90m<dbl>[39m[23m 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 9,~
$ year       [3m[90m<dbl>[39m[23m 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 2009,~
$ sex        [3m[90m<chr>[39m[23m "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",~
$ birth_year [3m[90m<dbl>[39m[23m 1944, 1944, 1947, 1947, 1947, 1951, 1951, 1952, 1952, 1954,~
$ age        [3m[90m<dbl>[39m[23m 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, 49,~
$ start_year [3m[90m<dbl>[39m[23m 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 1998,~
$ region     [3m[90m<fct>[39m[23m 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, 2,~
$ treated    [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ earnings   [3m[90m<dbl>[39m[23m 39975.010, 278378.100, 18682.600, 293336.400, 111797.300, 8~
$ region2    [3m[90m<fct>[39m[23m Region A, Region A, Region D, Region D, Reg

Again, do you see the difference between _region_ and _region2_ here?  This is how we can customize factor labels when creating new variables.

3.  The final method is very similar to the first; if we have a large data set, it can be tiresome to decode all of the variables one-by-one.  Instead, we can use `as_factor` on the **entire data set** and it will convert all of the variables with appropriate types.

In [18]:
fake_data <- as_factor(fake_data)

glimpse(fake_data)

Rows: 2,861,772
Columns: 10
$ workerid   [3m[90m<dbl>[39m[23m 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 9,~
$ year       [3m[90m<dbl>[39m[23m 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 2009,~
$ sex        [3m[90m<chr>[39m[23m "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",~
$ birth_year [3m[90m<dbl>[39m[23m 1944, 1944, 1947, 1947, 1947, 1951, 1951, 1952, 1952, 1954,~
$ age        [3m[90m<dbl>[39m[23m 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, 49,~
$ start_year [3m[90m<dbl>[39m[23m 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 1998,~
$ region     [3m[90m<fct>[39m[23m 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, 2,~
$ treated    [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ earnings   [3m[90m<dbl>[39m[23m 39975.010, 278378.100, 18682.600, 293336.400, 111797.300, 8~
$ region2    [3m[90m<fct>[39m[23m Region A, Region A, Region D, Region D, Reg

This is our final dataset, with all variables factorized.

#### 4.4.2 Remove missing data

We often face the challenge of dealing with missing observations for some of our variables. To check if any of our variables have missing values, we can use the `is.na` function alongside the `any`. This will return a value of TRUE or FALSE depending on whether we do or do not have any missing observations.

In [None]:
any(is.na(fake_data))

Here, we can see that our dataset already has no missing observations, so we do not need to worry about the process of potentially removing or redefining them. However, this will often not be the case.

> **Important Note**: choosing which observations to drop is always an important research decision. There are two key ways to handle missing data: just dropping it and treating "missing" as its own valid category. These decisions have important consequences for your analysis, and should always be carefully thought through - especially if the reasons why data are missing might not be random. <br>

Let's go through the process of dropping missing observations for the *sex* variable anyway, assuming that missing observations are coded as "not available". We will do this as a demonstration, even though no observations will actually be dropped. To do this, we will use the `filter()` method. This function conditionally drops rows (observations) by evaluating each row against the supplied condition. Only observations where the condition is true/met are retained (selection by inclusion) in the data frame. To use this to drop hypothetical missing observations for *sex*, we would do the following:

In [1]:
filter(fake_data, sex != "not available")

ERROR: Error in as.ts(x): object 'fake_data' not found


> **Recall**: The operator `!=` is a conditional statement for "not equal to". Therefore we are telling R to keep the observations that are not equal to "not available".

This process utilized the `filter` function, which retains rows meeting a specific condition. However, we can also supply a series of conditions to filter at once. We could have, for instance, decided that we only wanted to keep observations for females from region 1. In this case, we could run the following code.

In [None]:
head(filter(fake_data, sex == "F" & region == 1))

#### 4.4.3 Remove variables
Beyond filtering observations as was done above, we sometimes want to "filter" our variables. This process of operating on columns instead of rows requires the `select` function instead of the `filter` function. This is a useful function when we have more data at our disposal than we actually need to answer the research question at hand. This is especially pertinent given the propensity for datasets to collect an abundance of information, some of which may not be useful to us at a given point and instead slow down our loading and cleaning process.

Let's assume we are interested in seeing the gender wage gap among male and female workers of region 2, and nothing else. To help us with our analysis, we can filter by only observations which belong to region 2, then select for just the variables we are interested in.

In [None]:
head(fake_data %>% filter(region == 2) %>% select(sex, earnings)) 

We can see from above that we pass as parameters to the `select` function every column we wish to keep.

* `select(variables, I, want, to, keep)`
* `select(-variables, -I, -don't, -want)`

This is very useful and is usually done for practical reasons such as memory. Cleaning data sets to remove unessential information also allows us to focus our analysis and makes it easier to answer our desired research question. In our specific case, we want to keep data on just wages and sex. We will use the select function for this.

## 4.5 Wrap Up
In this notebook, we have covered the basic process of working with data. Specifically, we looked at how to load in data, how to view it, and how to clean data by factorizing and dropping variables and observations. This general scheme is critical to any research project, so it is important to keep in mind as you progress throughout your undergraduate economics courses and beyond. In the next section, we will cover a larger concept which is also essential to cleaning of a dataset, but merits its own section: creating variables.

## References
---
* [Introduction to Probability and Statistics Using R](https://mran.microsoft.com/snapshot/2018-09-28/web/packages/IPSUR/vignettes/IPSUR.pdf)
* [DSCI 100 Textbook](https://datasciencebook.ca/index.html)