# ECON 490: Generating Variables (5)

## Prerequisites 
---
1. Import datasets in csv and dta format 
2. Save files 

## Learning objectives:
---
In this module, you will learn to:
1. Explore a dataset with commands like `glimpse`, `View`, `head`, `summary`, `sapply`, `count` and `table`
2. Generate dummy (or indicator) variables using `ifelse`
3. Create new variables using `mutate`
4. Rename variables using `rename`

## 5.1 Getting Started
---
We'll continue working with the fake data dataset introduced in the previous lecture. Recall that this dataset is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.  

Last lecture we introduced the process of loading our `fake_data` dataset into R.
1. Import the relevant package (Haven) which gives us access to commands for loading the data. Also import the Tidyverse package in order to clean our data.
2. Use the read_csv or read_dta functions to load our dataset. 
3. Clean our data by factorizing all important variables.

Let's run through this procedure quickly so that we are all ready to do our analysis. 

In [1]:
library(haven)
library(tidyverse)

“package ‘haven’ was built under R version 4.1.3”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

“package ‘ggplot2’ was built under R version 4.1.3”
“package ‘tibble’ was built under R version 4.1.3”
“package ‘dplyr’ was built under R version 4.1.3”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [None]:
data <- read_csv("../econ490-stata/fake_data.csv")
data <- as_factor(data)

## 5.2 Commands to Explore the Dataset
---

### 5.2.1 `glimpse`

The first command we are going to use describes the basic characteristics of the variables in the loaded data set.

In [None]:
glimpse(data)

Alternatively, we can use the `print` command, which displays the same information as the `glimpse` command but in horizontal form.

In [None]:
print(data)

With many variables, this can be harder to read than the `glimpse` command. Thus, we typically prefer to use the `glimpse` command.

### 5.2.2 `View` and `head`

In addition to use the `glimpse` command, in R Studio we can also open our data editor and see the raw data we have imported as if it were an Excel file. To do so we can use the `View` function. This command will open a new tab with an interactive representation of our data. We can also use the command `head`. This prints out by default the first ten rows of our dataset exactly as it would appear in Excel. We can then specify numeric arguments to the function to increase or decrease the number of rows we want to see, as well as the specific rows we want via indicating their positions.

In [None]:
head(data)

There is even the function `tail`, which functions identically to `head` but works from the back of the dataset (outputs the final rows).

In [None]:
tail(data)

Opening the data editor has many benefits. Most importantly we get to see our data as a whole, allowing us to have a clearer perspective of the information the dataset is providing us. For example, here we observe that we have unique worker codes, the year where they are observed, worker characteristics (sex, age, and earnings), and whether or not they participated in the traning program. This is particularly useful when we first load a dataset, since it lets us know if our data has been loaded in correctly and looks appropriate.

### 5.2.3 `summary` and `sapply`

We can further analyze any variable by using the `summary` command. This commands gives us the minimum, 25th percentile, 50th percentile (median), 75th percentile, and max of each our variables, as well as the mean of each of these variables. It is a good command for getting a quick overview of the general spread of all variables in our dataset.

In [None]:
summary(data)

We can also apply summary to specific variables.

In [66]:
summary(data$earnings)

ERROR: Error in data$earnings: object of type 'closure' is not subsettable


If we want to quickly access more specific information about our variables, such as their standard deviations, we can supply this as an argument to the function `sapply`. It will output the standard deviations of each of our numeric variables. However, it will not operate on character variables. Remember, we can check the type of each variable using the `glimpse` function from earlier.

In [None]:
sapply(data, sd)

We can also apply arguments such as mean, min, and median to the function above; however, sd is a good one since it is not covered in the `summary` function.

### 5.2.4 `count` and `table`

We can also learn more about the frequency of the different measures of our variables by using the command `count`. We simply supply a specific variable to the function to see the distribution of values for that variable.

In [None]:
count(data, region)

Here we can see that there are five regions indicated in this data set, that more people surveyed came from region 1 and then fewer people surveyed came from region 3. Similarly, we can use the `table` function and specify our variable to accomplish the same task.

In [None]:
table(data$region)

We can also use `group_by` before our `count` command if we want more information about these regions. We can try this below with region and sex. We will see that there were 234,355 female identified persons surveyed in region 1 and 425,698 male identified persons surveyed in region 2. 

In [None]:
data %>% groupby(region) %>% count(sex)

##  5.3 Generate Dummy Variables
---
Dummy variables are variables that can only take on two values: 0 and 1. It is useful to think of a dummy variable as being the answer to a question that can be answered "yes" or "no". With a dummy variable the answer yes is coded as "1" and no is coded as "0".

Examples of question that are used to create dummy variables can include:

1. Is the person female? Females are coded "1" and males are coded "0".
2. Does the person have a university degree? People with a university degree are coded "1" and everyone else is coded "0".
3. Is the person married? Married people are coded "1" and everyone else is coded "0".
4. Is the person a millennial? People born between 1980 and 1996 are coded "1" and those born in other years are coded "0".

As you have probably already figured out, dummy variables are used primarily for data that is qualitative and cannot be ranked in any way. For example, being married is qualitative and "married" is neither higher nor lower than "single".  However, dummy variables sometimes also refer to variables that are qualitative and ranked, such as level of education, and sometimes for variables that are quantitative, such as age groupings. 

It is important to remember that dummy variables must always be used when we want to include categorical (qualitative) variables in our analysis. These are variables such as sex, gender, race, marital status, religiosity, immigration status etc. Without creating dummy variables for these demographics, analysis of the results from data analysis, regression, and other research will not be meaningful, as we are working with variables which have been numerically scaled in an arbitrary way. This is especially true for interpreting the coefficients outputted from a regression.

### 5.3.1 Creating Dummy Variables using `ifelse`

Let's do an example where we create a dummy variable that indicates if the observation identified as female. We are going to use the command `ifelse` which generates a completely new variable based on certain conditions. 

In [None]:
data$female = ifelse(data$sex == "F", 1, 0)

What R interprets here is that if the condition `sex == "F" ` holds, then our dummy will take the value of 1, else it will take the value of zero. This is where the `ifelse` functional component comes in. Depending on what you're doing, you may want it to be the case that when `sex` is missing, our dummy is zero. We can first check if we have any missing observations for a given variable by using the `is.na` function nested within the `any` function. If there are any missing values for the `sex` variable in this dataset, the code below will return TRUE.

In [None]:
sum(is.na(data$sex))

If we want to account for missing values and ensure that they are denoted as 0 for the dummy `female`, we can again invoke the `is.na` function as an additional condition in our function.

In [None]:
data$female = ifelse(data$sex == "F" & !is.na(data$sex), 1, 0)

The above condition within our function says that `female` == 1 only when `sex` == "F" and `sex` is not marked as NA (since !is.na must be TRUE).

### 5.3.2 Creating A Series of Dummy Variables using `ifelse`

We now know how to create dummy variables with `ifelse`. However, we may also want to create dummy variables corresponding to a whole set of categories for a given variable - for example, one for each region identified in the data set. To do this, we can just meticulously craft a dummy for each category, such as `reg1`, `reg2`, `reg3`, and `reg4`. We must leave out one region to serve as our base group, being region 5, in order to avoid the dummy variable trap.

In [None]:
data$reg1 = ifelse(data$region == 1 & !is.na(data$region), 1, 0)
data$reg2 = ifelse(data$region == 2 & !is.na(data$region), 1, 0)
data$reg3 = ifelse(data$region == 3 & !is.na(data$region), 1, 0)
data$reg4 = ifelse(data$region == 4 & !is.na(data$region), 1, 0)

This command generated five new dummy variables, one for each category for region. We asked Stata to call those variables "reg" and so these new dummy variables are called reg1, reg2, reg3, reg4. This is quite cumbersome. There are packages out there which help to expedite this process. Luckily, if we are running a regression on a qualitative variable such as `region`, R will generate the necessary dummy variables for us automatically.

## 5.4 Generating Variables based on Expressions
---
Sometimes we want to generate variables after some transformations (e.g. squaring, taking logs, combining different variables). We can do that by simply writing the expression as an argument to the function `mutate`. For example, let's create a new variable that is simply the natural log of earnings:

In [None]:
data <- data %>% mutate(log_earnings = log(earnings))

summary(data$log_earnings)

Let's try a second example, let's create a new variable that is the number of years since the year the individual started working. 

In [None]:
data <- data %>% mutate(experience_proxy = year - start_year)

summary(data$experience_proxy)

The `mutate` function allows us to easily add new variables to our dataframe. If we wanted to instead replace a given variable with a new feature, say add one default year to all experience_proxy observations, we can simply redefine it directly in our dataframe.

In [None]:
data$experience_proxy <- data$experience_proxy + 1

## 5.5 Following Good Naming Conventions
---
Choosing good names for your variables is more important, and harder, than you might think! Sometimes, the variables in a dataset have unrecognizable names, which may be confusing when conducting research. In these cases, it is a good idea to change them immediately. In your research, you will also be creating your own variables (like dummy variables) for qualitative measures and will want to be careful about giving them good names. This is especially important for generating tables, since you will want your tables to be easily legible in your paper.


We can rename variables with the `rename` function found inside the `dplyr` package (which we can access via having loaded in R's tidyverse). Let' try to rename one of those dummy variables we created above. Maybe we know that if region = 3 then the region is in the west.

In [None]:
rename(data, west = reg3)

Don’t think that you need to include every piece of information in your variable names. Most of the important information pertaining to a variable is included in its label (more on that in a moment). Avoid variable names that include unnecessary pieces of information and can only be interpreted by you. 

## Wrapping Up
---
When we are doing your own research, we *always* have to spend some time working with the data before beginning the analysis. In this module we have learned some important tools for manipulating data to get it ready for that analysis. Like everything else that you do in R, emphasis should be on readibility and reproducibility in your code. This is pivotal for you and your audience to understand your research.