# Introduction to Data

* **Authors**: COMET Team (Valeria Zolla, Colby Chamber, Colin Grimes, Jonathan Graves)
* **Last Update**: 10 August 2022
___

## Outline 
### Prerequisites

* Introduction to Jupyter
* Introduction to R 

### Outcomes

After completing this notebook, you will be able to:

* Identify and understand the packages and commands needed to load, manipulate, and combine data frames in R
* Load data using R in a variety of forms
* Create and reformat data, including transforming data into factor variables
* Handle missing data
* Select subsets of the data set for use
* Combine and append data

### References

* [Introduction to Probability and Statistics Using R](https://mran.microsoft.com/snapshot/2018-09-28/web/packages/IPSUR/vignettes/IPSUR.pdf)
* [DSCI 100 Textbook](https://datasciencebook.ca/index.html)

In [None]:
# Run this cell

source("intro_to_data_tests.r")

We will start this notebook by focusing on loading and cleaning up our data-set: these are **fundamental** skills which will be necessary for essentially every data project we will do.  This process usually consists of three steps:

1.  We load the data into R, meaning we take a file on our computer (or the web) and tell R how to interpret it.
2.  We clean up the data by removing missing variables or adjusting the way they are interpreted.
3.  We select and filter the data into an appropriate dataset for our project, which can involve creating or deleting variables or observations.

In this notebook, we will cover each of these three steps in detail. Let's start by looking at the loading process.

## Loading Data

Remember, before we can load the data we need to tell R what **packages** we will be using in the notebook. Without these packages, R will not have access to the appropriate functions needed to interpret our raw data. As explained in the previous lesson, packages only need to be installed once; however, they need to be imported every time we open a notebook.

We have discussed packages previously: for data loading, the two most important ones are `tidyverse` and `haven`.
* `tidyverse` should already be somewhat familiar. It includes a wide range of useful functions for working with data in R.
* `haven` is a special package containing functions that can be used to import data.

Let's get started by loading them now.

In [None]:
# loading in our packages
library(tidyverse)
library(haven)

Data can be created by different programs and stored in different styles - these are called **file types**. We can usually tell what kind of file type we are working with by looking at the extension.  For example, a text file usually has an extension like `.txt`.  The data we will be using in this course is commonly stored in STATA, Excel, text, or comma-separated variables files.  These have the following types:

* `.dta` for a STATA data file
* `.xls` or `.xlsx` for an Excel file
* `.txt` for a text file
* `.csv` for a comma-separated variables file

To load any dataset, we need to use the appropriate function so that R can read the dataset correctly. 

- To load a `.csv` file we use the command `read_csv("file name")`
- To load a STATA file we use the command `read_dta("file name")`
- To load an Excel file we the command `read_excel("file name")`
- To load a text file we use the command `read_table("file name", header = FALSE)`.
  - The header argument specifies whether or not we have specified column names in our data file. 
  
> **Note:** if we are using an Excel file, we need to load in the package `readxl` alongside the `tidyverse` and `haven` packages above to read the file.

In this notebook, we will be working with data from the Canadian census which is stored as `01_census2016.dta`.  

#### Exercise: Load in the Dataset

Which function should we use to load this file? Write the name of the function just before the brackets (e.g. `read_table`)

In [None]:
# which function should we use?

answer0 <- "..."

test_0()

Did you get it?  Okay, now replace the `...` in the code below with that function to load the data!

In [None]:
# reading in the data
census_data <- ??(".../datasets/01_census2016.dta")  # change me!

# inspecting the data
glimpse(census_data)

## Cleaning Data
Now that we've loaded our data, the next step is to do some rudimentary cleaning. This can include redefining and factorizing variables, defining new variables, and dropping missing observations.

### Factor Variables

We have already seen that there are different types of variables which can be stored in R. Namely, there are quantitative variables and qualitative variables. Any quantitative variable can be stored in R as a set of strings or letters. These are known as **character** variables. Qualitative variables can also be stored in R as factor variables. Factor variables will associate a qualitative response to a categorical value, making analysis much easier. Additionally, data is often **encoded** which means that the levels of a qualitative variable have been represented by "codes", usually in numeric form.

Look at line `pr` in the output from `glimpse` above:

```
pr      <dbl+lbl> 35, 35, 11, 24, 35, 35, 35, 10, 35, 35, 59, 59, 46, 24, 59
```

The `pr` variable in the Census data stands for province.  Do these look much like Canadian provinces to you? This is an example of encoding.  We can also see the variable type is `<dbl+lbl>`: this is a _labeled double_. This is good: it means that R already understands what the levels of this variable mean.

There are three similar ways to change variables into factor variables. 

1.  We can change a specific variable inside a dataframe to a factor by using the `as_factor` command

In [None]:
census_data <- census_data %>%  #we start by saying we want to update the data, AND THEN... (%>%)
    mutate(pr = as_factor(pr)) #mutate (update) pr to be a factor variable

glimpse(census_data)

Do you see the difference in the `pr` variable?  You can also see that the type has changed to `<fct>` for **factor variable**.

The reason R was able to do this automatically is because the type was `<dbl+lbl>`; because of the information contained in the original dataset, R already knew how to "decode" the factor variables from the imported data.  But what if that wasn't the case?  This brings us to the next method.

2.  We can **supply a list of factors** using the `factor` command.   This command takes two other values:
    * A list of levels the qualitative variable will take on
    * A list of labels, one for each level, which describes what each level means
    
Let's look at the `pkids` (has children) variable as an example. Let's suppose we didn't notice that it is of type `<dbl+lbl>` _or_ we decided we didn't like the built-in labels.  We can create a custom factor variable as follows:

In [None]:
# first, we write down a list of levels
kids_levels = c(0,1,9)

# then, we write a list of our labels
kids_labels = c('none', 'one or more', 'not applicable')

# finally, we use the command but with some options - telling factor() how to interpret the levels

census_data <- census_data %>%  # we start by saying we want to update the data, AND THEN... (%>%)
    mutate(pkids = factor(pkids,   # notice the function is "factor", not "as_factor"
                          levels = kids_levels, 
                          labels = kids_labels)) # mutate (update pkids) to be a factor of pkids
glimpse(census_data)

Again, do you see the difference here?  This is how we can customize factor labels when creating new variables.

3.  The final method is very similar to the first; if we have a large dataset, it can be tiresome to decode all of the variables one-by-one.  Instead, we can use `as_factor` on the **entire dataset** and it will convert all of the variables with appropriate types.

In [None]:
census_data <- as_factor(census_data)
glimpse(census_data)

This is our final dataset, all cleaned up! Notice that some of the variables (e.g. `ppsort`) were _not_ converted into factor variables.  

> **Test Your Knowledge**: Can you tell why?

### Creating New Variables

Another important clean-up task is to make new variables.  It's pretty uncommon for a data set to already contain all of the variables we want!

To do this, we can use the `mutate` command again: in fact, when we were making factor variables earlier, we were, in a sense, already making new variables!  However, let's see how we can do something more complex.

As we mentioned, to create any new variable we use the command `mutate`. This command is an efficient way of manipulating the columns of our data frame.  We supply to mutate a formula for creating the new variable:

```
census_data <- census_data %>%
        mutate(new_variable_name = function(stuff...))
```

Let's see it in action with the `log()` function which we will use to create a new variable for the natural logarithm of wages. This can be quite useful for many models.

In [None]:
census_data <- census_data %>%
        mutate(log_wages = log(wages)) # the log function

glimpse(census_data)

Do you see our new variable at the bottom?  Nice!

#### The `case_when` function

We can also use more complex functions.  We won't cover all of them in this notebook, but we will mention one very important example: the `case_when` function.  This function creates different values for an input based on specified cases. You can read more about it by running the code block below.

In [None]:
?case_when

Essentially, this function consists of a series of lines, and each line gives (i) a case and (ii) a value for the case.
> **Tip**: you can create a "default" case by using TRUE as the case in the last line.

Suppose we are looking at our `pkids` variable and find it frustrating that it has three levels (`'none', 'one or more', 'not applicable'`).  We are interested in creating a dummy variable which is equal to one if the respondent has children and zero otherwise. Let's call it `has_kids`. We can use the `case_when()` function to achieve this!

In [None]:
census_data <- census_data %>% 
    mutate(has_kids = case_when( # make a new variable, call it has_kids
        pkids == "none" ~ 0, # case 1: pkids is "none"; output is 0 (no kids)
        pkids == "one or more" ~ 1, # case 2: "one or more"; output is 1 (kids)
        pkids == 'not applicable' ~ 0)) # case 2: "not applicable"; output is 0 (no kids)

glimpse (census_data)

Notice that `has_kids` is not a factor variable.  We have to add on the appropriate code to do that.

#### Exercise: Factorize `has_kids`

Create an object, stored in `answer1`, in which the `census_data` data frame is identical to the one above but in which the `has_kids` variable is also in factor form.

In [None]:
answer1 <- census_data %>% 
                ... # fill me in!

test_1()

#### More Complex Variables
We can also create dummy variables from more complex variables. For example, we can create a dummy variable called `retired`. If the person is of retirement age then it will equal one. If they are younger than 65 (ie, not of retirement age) then it will equal zero. To create this new variable we need to join many categories of the `agegrp` variable into one. 

In [None]:
census_data <- census_data %>% 
    mutate(retired = case_when((agegrp == "65 to 69 years")|(agegrp == "70 to 74 years")|(agegrp == "75 to 79 years")|(agegrp == "80 to 84 years")|(agegrp == "85 years and over") ~ 1, 
                               TRUE ~ 0)) %>% # otherwise
    mutate(retired = as_factor(retired)) # factor

glimpse(census_data)

We typically want to create a dummy variable to identify which individuals have a certain characteristic and which ones don't (using a binary configuration). This type of variable is useful in many models which have trouble handling more complex types of data. You now know how to create these variables!

#### Exercise: Adding a Variable
Create an object `answer2` in which the `census_data` dataframe now has an extra dummy variable called `knows_english` which is equal to 1 if the respondent knows English and 0 if not. Make sure that this variable is factorized.

In [None]:
answer2 <- ... # fill me in!

test_2()

### Removing Missing Data

Removing missing data is an important step of our cleaning process. For example, if we wanted to create a dummy called `visible_minority_status` (1 if the person identifies as a visible minority), we would get our values from the variable `vismin`. However, if we check the different levels and labels of that variable we see that visible minority status is not available for all individuals. How do we handle this situation?

<span style="color:#CC7A00" > 🔎 **Let's think critically** </span>
> 🟠 Choosing which observations to drop is always an important research decision. There are different ways to handle missing data beyond just dropping it - for instance, by treating "missing" as its own valid category. These decisions have important consequences for your analysis, and should always be carefully thought through - especially if the reasons why data are missing might not be random.
>
> 🟠 In the context of census data, why might responses to the question of a person’s visible minority status be missing?\
> 🟠 Why might a state apparatus like Statistics Canada be interested in collecting data in this dimension?\
> 🟠 What might be some consequences for an analysis if these observations were dropped?

Let's start with simply removing the observations where data is not available. To do this, we will use the `filter()` method which conditionally drops observations. Each row (observation) is evaluated against the supplied condition. Only observations where the condition is true or met are retained (selection by inclusion) in the dataframe. The `filter()` method checks all rows at once, not just one at a time. Let's see it in action below.

In [None]:
glimpse(filter(census_data, vismin != "not available"))

> **Recall**: The operator `!=` is a conditional statement for "not equal to". Therefore we are telling R to keep the observations that are NOT equal to "not available".

Notice the change in the number of observations here.  Now that `NA` data has been deleted we can create the dummy!

In [None]:
census_data <- filter(census_data, vismin != "not available")

census_data <- census_data %>% 
    mutate(minority = case_when(vismin == "not a visible minority" ~ 0, 
                                TRUE ~ 1)) %>%
    mutate(minority = as_factor(minority))
           
glimpse(census_data)

Now if we wanted to check whether there is a difference between the wages of those who are and are not visible minorities, we could create a table, make a chart or run some calculations, all concepts we will learn about in other notebooks. 

#### Exercise: Removing Missing Data

Create an object `answer3` which uses the `filter()` function to drop values of "NA" from `pkids` in our census dataframe.

In [None]:
answer3 <- census_data %>% ...

test_3()

### Changing Variable Names 

One optional step in our data cleaning process is changing the names of variables so they are appropriate and understandable when producing research. More specifically, this means making all of our variable names short and descriptive. Many datasets from statistical agencies will have excellent variable names, so this is a conventional and helpful process to go through when we encounter a data set with non-descriptive names. For instance, many older datasets have variables with names like `EDMM1928`, which are not very helpful!

We can use the command `rename(new name = old name)` to change a variable's name. As you may have seen before, we can also use `mutate()` to create a new variable category.

<span style="color:#CC7A00" > 🔎 **Let's think critically** </span>
> 🟠 Sometimes variable name changes and category creations happen for reasons other than making a label clearer as we’ve seen above. For example, the label “illegal alien” was used to describe a genre of immigration status by the US Immigration and Customs Enforcement and the US Customs and Border Protection from 1980 to 2021, until pressures from [student activism](https://visaandgreencard.com/blog/library-congress-will-stop-using-phrase-illegal-alien/) and the [Biden administration](https://www.washingtonpost.com/immigration/illegal-alien-assimilation/2021/04/19/9a2f878e-9ebc-11eb-b7a8-014b14aeb9e4_story.html) resulted in its phasing out under the argument that this categorical label enabled racist attitudes to status-seeking folks in the United States. Today, many choose to use terms like “undocumented immigrant”, “non-citizen” and “migrant” instead.
>
> 🟠 How might the language of variable names shape how people, groups or concepts are represented?\
> 🟠 How can individuals (you!) and institutions ensure that label choices honour the consent and agency of the people and groups that they affect? Are there any tradeoffs associated with this pursuit?

In [None]:
head(census_data) %>% 
    rename(highest_degree = hdgree)

## Adding and Removing Data
Another step in data cleaning involves bringing in new data from other datasets or dropping existing data from our data set entirely. In this step, we must always think very carefully about the purpose of our data in terms of the research we are doing. The underlying goal of our project should always guide decision to bring in new datasets or removing information from our existing one.

### Combining Datasets

Often we will need to draw on data from multiple datasets. Most of the time, these datasets will be available for download in different files (each for a given year, month, country, etc.). For example, if we are working with macroeconomic data from Statistics Canada such as inflation rates, GDP, and population, data for these variables may be stored in different datasets corresponding to different years. Thus, if we want to compile them we need to combine them into the same data frame.

There are two key ways of combining data, each reflecting different goals:

1.  When we want to add more observations from another dataset into our existing dataset, we call this **appending** data.
    * If you think of a dataset as a spreadsheet, this is like taking one dataset and "pasting" it into the bottom of a current dataset to add more observations. We do this when the observations from the other dataset have recorded data for all existing variables.
2.  When we want to add new variables and their data from another dataset into our existing dataset, we call this **merging** data.
    * This is like looking up values in a table and then adding a column; in Excel, this is called a `VLOOKUP`. Importantly, we can only merge data that share a common column or **key** to  identify observations with particular values. For example, if we want to merge in data from a different year but for the same people (observations) as those we are currently working with, datasets will usually have an identifying number for the person that functions as our key. In this dataset, every individual is identified by their `ppsort` number, so this is the key we would use to merge in new variables. 

#### Appending Datasets
Let's say that our `census_data` data set is inexplicably missing 2 observations, the people coded as 867543 and 923845. These people have recorded observations for all of the same variables as those in our existing data frame, so we can simply append them to our `census_data` data frame. Let's add these two people to our data set for analysis by appending `extra_census_data` to `census_data`.

In [None]:
extra_census_data <- data.frame(ppsort = c(867543, 923845), agegrp = c("20 to 24 years", "35 to 39 years"), 
                                ageimm = c("5 to 9 years", "0 to 4 years"), cip2011 = c("01 education", "01 education"),
                                fol = c("english only", "french only"), hdgree = c("bachelor's degree", "bachelor's degree"),
                                immstat = c("non-immigrants", "immigrants"), kol = c("english only", "french only"),
                                lfact = c("employed - worked in reference week", "employed - worked in reference week"),
                                locstud = c("ontario", "quebec"), mrkinc = c(20000, 34000), pkids = c("none", "one or more"),
                                pr = c("ontario", "quebec"), sex = c("male", "female"), vismin = c("chinese", "black"),
                                wages = c(15000, 30000), log_wages = c(9.615805, 10.308952), has_kids = c(0, 1),
                                retired = c(0, 0), minority = c(1, 1))

census_data <- rbind(census_data, extra_census_data)

glimpse(census_data)

Here we used the function `rbind()` to append these datasets. We did this because we were appending rows to the bottom of the data frame. Our `census_data` data set now has 2 more rows than before, meaning we have successfully appended these two observations to our data frame for analysis.

#### Merging Datasets
Suppose now that we are undergoing research and realize that we would like to include religion in our analysis. However, we don't have a variable pertaining to it currently in our `census_data` data set. Luckily, we find the data set `more_census_data` which includes a dummy variable for whether the respondent `ppsort` is religious. This variable is coded as `religious` and equals 1 if the person self-identifies as religious and 0 if not. 

In [None]:
set.seed(123) #ignore this function!

more_census_data <- data.frame(ppsort = census_data$ppsort, religious = sample(0:1, 382278, replace = TRUE))

glimpse(more_census_data)

We can see from glimpsing the above data set that it contains the same number of observations as our `census_data` data set, one observation for each `ppsort` ID. For each `ppsort` ID, there is a corresponding dummy of 0 or 1 indicating whether that person is religious. Since this is a new variable we want to add to our existing data set, we can merge this `more_census_data` into our existing data frame to match up each `religious` entry to its corresponding `ppsort` ID within the larger `census_data` data frame. We can do this with the function `merge()`. This function has the following default form:

Merged_Data <- merge(data frame A, data frame B, by = "the variable that is common in both data frames")

If the column on which we want to merge our datasets is named differently in the two data frames, we must specify each column name in the by = "" option above.

In [None]:
census_data <- merge(census_data, more_census_data, by = "ppsort")
glimpse(census_data)

We can now see that our `census_data` data set has an additional variable, `religious`, with data for each person. This was an example of a simple 1:1 (one-to-one) merge since there was a perfect match between the number of observations in both datasets. In this way, there were no extra observations in either. However, this will not often be the case. 

Suppose we only had half of the `ppsort` IDs in our `more_census_data` data frame. When merging, we could have then chosen to create our merged data frame with ONLY those `ppsort` who can find a "match" in both datasets. We could do this with the `left_join()` or `right_join()` functions, using the former if our first specified data set was `census_data` and the latter if our second specified data set was `census_data` within the brackets of the function. If we instead had some observations in each data set which did not have a match (the `ppsort` ID did not match) and we didn't want to lose these observations, we could use the `full_join()` option to retain all observations within both datasets in our final dataframe. 

Finally, if we wanted only those observations which had matching `ppsort` IDs across both datasets, we could use the `inner_join()` function, returning us a smaller dataset. This is implicitly what we did in our merge above; however, since all observations in both datasets matched on `ppsort`, we didn't lose anything. For a helpful clarification of these commands, check out the help menu for these specific merges below!

In [None]:
?inner_join

#### Exercise: Combining Dataframes

Create an object `A` which is a data frame with two columns and two rows. The first column should be called `ID` and have values 1 and 2, while the second column should be called `value1` and have values 10 and 11. 

Then create a second object named `B` which is a dataframe with two columns and two rows. The first column should be called ID and perfectly match the IDs in `A`, while the second column should be called `value2` and have values 12 and 13. Finally, merge these two data frames along the ID column and store in the object `answer4`.

In [None]:
# your code here

answer4 <- ... # fill me in!

test_4()

### Selecting and Filtering Data

Oftentimes we have more data at our disposal than we actually need to answer the question at hand. Because the process of data collection is so resource-intensive, surveyors and researchers attempt to collect as much data as possible from each individual that is surveyed. This massive amount of data allows many research questions to be explored, even ones that weren't intended from the surveyors to begin with. The Canadian census is a great example of this. 

Let's assume we are interested in determining the gender pay gap for residents in British Columbia. Our current 2016 census dataset encompasses a whole host of data from nearly all residents in Canada and contains variables such as their age group, education level, income, minority status, etc. To help us with our analysis, we need to filter the census data only for residents in British Columbia. As mentioned above, the `filter()` method is used to conditionally drop rows. In our case, we can use the filter function to go through each observation in the data frame and check to see whether the province `pr` is coded as "british columbia", then drop all observations for which this is not the case.

> **Note**: we've seen this before, but to check equivalency in R and most programming languages, you need to use `==` as opposed to `=`.

In [None]:
census_data <- census_data %>%
    filter(pr == "british columbia")

glimpse(census_data)

Sometimes we want to drop variables (columns) instead of observations (rows). The `select` method in R allows us to do this. We pass as parameters to the `select` function every column we wish to keep.

* `select(variables, I, want, to, keep)`
* `select(-variables, -I, -don't, -want)`

This is very useful and is usually done for practical reasons such as memory. Cleaning data sets to remove unessential information also allows us to focus our analysis and makes it easier to answer our desired research question. In our specific case, we want to keep data on just wages and sex. We will use the select function for this.

In [None]:
head(census_data %>% select(wages, sex))

As seen above, this function allows us to look at the relationship between sex and wages more directly in our raw data. To finish finding the wage gap among British Columbians (keep in mind we have filtered out those who do not fit pr == "british columbia"), we can then invoke the `group_by()` and `summarize()` functions we have learned in past notebooks.

In [None]:
# creating our more focused dataframe
answer <- census_data %>%
    mutate(sex = as_factor(sex)) %>% 
    filter(wages != "NA") %>%     
    group_by(sex) %>%
    summarize(average_wage = mean(wages))

answer

Using the `group_by` and `summarize` tools we learned in previous lessons has allowed us to address our research question. We find that the gender pay gap between male and female British Columbians is roughly $23,000.

<span style="color:#CC7A00" > 🔎 **Let's think critically** </span>
> 🟠 The place name “British Columbia” was invented when this region became an official province of the larger colonial project of “Canada” in 1871. This name, [having roots in European conquest _(British)_ and the Spanish “explorer” Columbus _(Columbia)_](https://thewalrus.ca/rename-british-columbia/), has only been used to refer to this region for around 150 years. For thousands of years before then, this region was known by many other names in many different languages by over 200 unique First Nations, each with their own distinct lifeways and traditions.  If you are not Indigenous to the place you are currently located, we invite you to reflect on the following:\
> 🟠 Where are you currently situated right now? You can use [this interactive map](https://native-land.ca/) to see what Indigenous lands you are on.
>
> 🟠 The place names reflected in the 2016 Canadian census data reflect a particular, politically-involved way of organizing and understanding space and consequently enable particular kinds of economic research questions (ie, how does average income in BC compare to average income in Alberta?) What kinds of questions can the Canadian census data _not_ answer?\
> 🟠 What kinds of economic research questions might we be able to form if given population data that was spatially arranged in a fashion resembling the [Native Land Digital Map?](https://native-land.ca/)\
> 🟠 If you are located in what is currently referred to as “British Columbia”, how can you learn about economic practices that are local to this area? The province of BC's [Indigenous Economic Development website](https://www2.gov.bc.ca/gov/content/employment-business/economic-development/bc-ideas-exchange/success-stories/indigenous-economic-development) has a list of just _some_ of the ways that local Indigenous groups are applying their own economic traditions today.

#### Exercise: Using Select and Filter

Use `census_data` to create a data frame which only shows the highest education level `hdegree` for British Columbians with 100,000 CAD or more in wages. Your data frame should have a single column with the name as `hdegree`. Store your answer in the object `answer5`.

In [None]:
answer5 <-

test_5()

## Conclusion
In this notebook, we have covered the basic process of working with data. You should now be familiar with how to load in data, how to define and redefine variables, how to drop missing observations, how to add in new data and drop existing data to meet your research purposes. This general scheme is critical to any research project, so it is important to keep in mind as you progress throughout your undergraduate economics courses and beyond.