# Chapter 4 - Organizing Data

## 4.1 The structure of data 

Data can come to us in many forms. If you collect data yourself, you may start out with numbers written on scraps of paper. Or you may get a computer file filled with numbers and words of various sorts, representing many variables at once. 

So far we have worked with one variable at a time, which is easy to visualize and understand simply by printing its vector. However, most real-world data has many variables stored at once. To manage this, it is necessary to organize and format data so that they are easy to analyze using statistical software. There is no one way to organize data, but there is a way that is most common, and that is what we recommend you use.

This common way is called a **dataframe**. Under this framework, data are stored into rectangular tables, with rows and columns. What goes on each row and column follows two principles:

- Each column is a variable
- Each row is an observation (or, we have been calling it a case or an object to which a measure is attached)
- The first row contains the names of all the variables
- The data within a dataframe are from the same dataset

Rectangular tables of this sort are represented in R using new data type - conveniently, called a dataframe. The columns are the variables; this is where the results of measures are kept. The rows are the cases sampled. Each value of different variables that are on the same row belong to the same observation (e.g., the same person, zip code, habitat, etc.) Data frames provide a way to save information such as column headings (i.e., variable names) in the same table as the actual data values.

Principle 4 above simply states that the types of observations that form the rows cannot be mixed within a single table. So, for example, you wouldn’t have rows of college students intermixed with rows of cars or countries or couples. If you have a mix of observation types (e.g., students, families, countries), they each go in a different dataframe.

## 4.2 Creating a dataframe

### Combining vectors
Let's say we have two vectors, ```x1 <- c(1,2,3,4,5)``` and ```x2 <- c(18,21,20,23,20)```. We can combine these together into a dataframe using the ```data.frame()``` function.

In [None]:
x1 <- c(1,2,3,4,5)
x2 <- c(18,21,20,23,20)
data.frame(x1=x1, x2=x2)

This function takes as arguments all the vectors you want to combine, and lines them up together as columns in the order you typed them in. This way, you imply that the second item in ```x1``` and the second item in ```x2``` belong to the same observation, since they appear on the same row in the dataframe. This function also gives the variables names; let's rewrite the code above to give ```x1``` and ```x1``` more meaningful variable names.

In [None]:
data.frame(user_id=x1, age=x2)

Using this approach, you can form a dataframe out of any set of variables - well, almost. Try out the code below, and see if you can find why it causes an error. What can we do to fix it? 

In [None]:
x1 <- c(1,2,3,4,5)
x2 <- c(18,21,20,23)
data.frame(x1=x1, x2=x2)

Since a dataframe is a rectangle, in order for vectors to be combinable, they need to be the same lengths. 

### Importing existing data

The vast majority of the time in data analysis, you are not manually creating vectors and dataframes. Instead, you are importing an existing file of data. So for the rest of this chapter we're going to practice working with and understanding an example dataset from a study by Crum & Langer (2007) called [Mindset Matters](https://search.r-project.org/CRAN/refmans/Lock5Data/html/MindsetMatters.html), on how the placebo effect works with exercise. In this study, 75 hotel housekeepers were either told that their daily work accounted for all the daily exercise they need, or were told nothing. 

You may have experience opening datasets in programs like Excel with an .xlsx extension. The extension tells you what kind of file format it is - .xlsx, for example, is an excel file. There are many file formats data can be saved in that are more or less easy to open in various software. Our Mindset Matters data is saved as a .csv, meaning a "comma-separated value" format, as this is really easy for computers programs to parse. If you were to open this file in a basic text editor program, you'd see rows are separated on different lines, and columns are separated by commas. In R, the ```readr``` package can understand this format and create a rectangle dataframe for us. 

```readr``` is not a default package in R - it has to be downloaded the first time you use it, and then loaded every time you start a new R session. Edit the code below to install the ```readr``` package and load it into your R environment.

In [None]:
# finish the code below to install readr
install.packages("   ")

#finish the code below to load readr
library(    )

Before we can load our data, R has to know where on your computer it is located. By default, it only looks in what is called your **working directory** - a specific folder on your computer where your code also lives. You can use the function ```getwd()``` to learn what your current working directory is.

In [None]:
getwd()

Because this book is hosted on a website, the above code should show you the directory where these web files are stored. On your own computer, ```getwd()``` will return something different. 

We've stored the ```mindsetmatters.csv``` file in the same directory as this code, so R will be able load it if you simply use the function ```read_csv()``` from the ```readr``` package. Try it out below - the only argument you need for the function is the name of the file, in quotes.

In [None]:
read_csv("mindsetmatters.csv")

If the data lived in a different directory, for example in a folder called "datasets", you'd have to include that in the filename for R to be able to find it. I.e., ```read_csv("datasets/mindsetmatters.csv")```.

## 4.3 Looking at data

The first thing to do when opening a dataset for the first time is to understand it - what each variable is, how many data points, what the variable data types are, etc. 

After you run the code above, you should see a big print out of the whole dataframe. You might be thinking to yourself, “Wow, that’s a lot to take in!” In fact, this would be considered a small dataset! Real data can have hundreds or thousands of datapoints stored together, with hundreds or thousands of variables recorded. It can get unweildy to look at a whole dataset at once. Thus its usually best to look at a summary of the data in a dataframe, or a sampling of it. 

We can use certain functions to do this summarizing for us. But first, we have to make sure we save our dataframe so we can keep using it over and over. Remember that we have a new data type, dataframe, which means it can be saved as an object itself. In the code below, read the dataframe into an object called ```mindsetmatters```.

In [None]:
# read the dataframe into an object called mindsetmatters
mindsetmatters <- # your code here

# check the data type of this object
str(mindsetmatters)

A cool thing about the ```str()``` function is that, when you're working with a complex object like a dataframe that contains many values, it'll tell you the type of the object as a whole as well as the type of every variable within it. Look at the output above - what data type is the variable "Age"? Can you find where it tells you the size of the dataframe? 

Using ```str()``` lets you summarize the information within a dataframe without opening the whole thing to view. Another way to do it, if you want to see how the dataframe is arranged, is to use the function ```head()```. This will show you just the first 6 lines of the dataframe.

In [None]:
#run this code to see the head of the dataframe
head(mindsetmatters)

## 4.4 Accessing variables

Oftentimes we want to reference a specific variable within a dataframe. So long as a datafile was saved with the first line having the names of each variable, we can access it using the ```$``` symbol in R. If you want to specify the ```Age``` variable in the ```mindsetmatters``` dataframe, for example, you would write ```mindsetmatters$Age```. 

Try using the ```$``` symbol to print out just the variable ```Age``` from ```mindsetmatters```.

In [None]:
# Use the $ sign to print out the contents of the Age variable in the mindsetmatters dataframe


When R is asked to print out a single variable (such as ```Age```), R prints out each person’s value on the variable all in a row. You can then work with this variable as if it were a single vector, like we used in the previous chapter. Try "aging up" everyone's age by adding two years to the ```Age``` variable. (Remember that adding a constant to a vector adds that value to every item in the vector).

In [None]:
# Add two years to the value of everyone's age in the variable Age, and print out the result


Usually you want to access variables this way - it has more meaning to humans reading code. But just so you know, you can also access variables using index notation with ```[]``` brackets. However, a dataframe has two dimensions (rows and columns), so to access data using this we have to give a number value for both which row we want, and which column we want. E.g., ```[row_num, col_num]```. 

If we want to print out the entirety of the third row from mindsetmatters, we'd use ```mindsetmatters[3,]```. Leaving nothing after the comma for column number allows R to return every value in the third row regardless of column. Likewise, returning the entirety of the third *column* from mindsetmatters would look like ```mindsetmatters[,3]```. Try it out in the code below. 

In [None]:
#Write code to return the 10th row of mindsetmatters

#Write code to return the 5th column of mindsetmatters

#What do you think this code will return?
mindsetmatters[10,5]

## 4.5 Subsetting data

Sometimes you want to focus on a subset of your variables in a data frame. For example, you might care about just the variables ```Age``` and ```Wt``` in the ```mindsetmatters``` dataframe. The output would be easier to read if it only included these variables and not all the others.

We can use the ```select()``` function from the ```dplyr``` package to look at just a subset of variables. ```dplyr``` is a very commonly-used package that makes working with dataframes easy (click [here](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for a cheat sheet of many ```dplyr``` functions). When using ```select()```, we first need to tell R which dataframe, then which variables to select from that dataframe.

In [None]:
install.packages("dplyr")
library(dplyr)

#Make a prediction, what do you think this code will do? 
select(mindsetmatters, BMI, Fat)

You may need to scroll the output up and down to see it all. It’s quite a lot because the function ```select()``` will print out all the values of the selected variables. What the ```select()``` function actually does is return a new data frame with the selected subset of columns.

If you want to look at just a few rows of a few variables, we can combine ```head()``` and ```select()``` together in a nested function.

In [None]:
head(select(mindsetmatters, BMI, Fat))

Nesting functions allows us to do multiple operations on one line. Nested functions always evaluate from the inside out - ```head()``` is applied to the result of ```select()```. In this case, a new dataframe is made by ```select()```, and then ```head()``` returns the first six lines of it. Technically, you can nest as many functions as you want! But it does get harder to read the deeper you go, so beware what your future self will appreciate when reading the code again later. 

The ```select()``` function lets us look at a subset of variables. But sometimes you might want to look at a subset of observations. Notice the first person in the ```mindsetmatters``` dataframe has an age of 43. Are they the only person of that age? What happens if we put 43 as an argument in ```select()```?

In [None]:
select(mindsetmatters, 43)

```select()``` only works on *variables* (or columns of the data frame). In this case, since you gave it the number 43 for the second argument, it thought you were asking for the 43rd column - but there aren't 43 columns in this dataset! 

To get a subset of observations (or rows of the data frame) we use a different function: ```filter()```. This function filters the dataframe to show only those observations (rows) that match some criteria. For example, here is the code that will return only the observations where the age is 43:

In [None]:
filter(mindsetmatters, Age==43)

The function ```filter()```, like ```select()```, returns a dataframe. In this case, the data frame only has one row because only one person is exactly 43 years old. 

In [None]:
#How would you use filter() to return a dataframe of everyone who is *at least* 43 years old?


One challenge for students is to keep track of the difference between an observation (e.g., housekeepers, represented in rows), a variable (e.g., ```Age``` or ```Wt```, represented in columns), and the values a variable can take (e.g., 43, represented in cells). It is helpful to imagine the rows and columns of a data frame when you read about observations and variables, respectively. If the data are in a dataframe format like this (also known as "tidy"), the rows will always be observations and the columns, variables.

In this course we will be providing most of the data you analyze in a tidy format. However, the world is not always tidy. One day, in the wild world outside of this course, you may have to transform a non-tidy data set into a tidy one.

## 4.6 Manipulating data

Once data are in a tidy format, we can use R commands to manipulate the data in various ways. Let’s look at a few common things you might need to do before analyzing your data:

- Identifying Missing Data
- Filtering Data
- Creating New Variables
- Recoding Variables

### Identifying missing data

Sometimes (in fact, usually) we end up with some missing data in our data set. People don't always answer every question on a survey, someone didn't show up to their third doctor's appointment, etc. R represents missing data with the value NA (not available). It will also recode a blank cell in a file as NA. If your data set represents missing data in some other way (e.g., some people put the value -999), you should recode the values as NA when working in R.  

Let’s consider the ```Fat2``` variable in the ```mindsetmatters``` dataframe. First, we'll reshuffle the order of rows in the dataframe based on ```Fat2```, so we can better see the missing data. To do this, we'll use the ```arrange()``` function from ```dplyr```. 

In [None]:
# we've already loaded dplyr into R in a previous line of this chapter, so we don't need to
# do it again

mindsetmatters <- arrange(mindsetmatters, Fat2) #remember variable names are case sensitive!
print(mindsetmatters$Fat2)

Based on this, we can easily see some missing data in this variable. We can also count how many observations have this value:

In [None]:
# the function count() in dplyr will count how many observations in a dataframe have a specific value
count(mindsetmatters, Fat2)

In this dataframe, ```Fat``` is the percentage of a participant's body fat at the beginning of the study, and ```Fat2``` is the percentage at the end of the study. Use the code window below to explore whether there are any missing values for ```Fat```. Why do you think ```Fat2``` has more missing values?

In [None]:
#Write your code here


Now that you can find missing data, the big question is - what to do about it? There is a whole area of study in quantitative psychology on what to do with missing data. This is an advanced topic, so we won't get into it in this course. Instead, in future lessons we'll tell you what actions to take for missing data, but just know that there are different opinions out there about what the *best* actions are. 

### Filtering data

One of these options is to remove all the observations from a dataset that are missing data on a variable you care about. We can use the ```filter()``` function, introduced previously, to remove observations with missing data from a data frame. For example:

In [None]:
filter(mindsetmatters, Fat2 != "NA")

Can you remember what the ```!=``` operator does? This code returns a dataframe that includes only cases for which the variable ```Fat2``` is *not* equal to NA. Note that the ```filter()``` function filters in, not out (i.e., keeps).

As with anything in R, your filtered dataframe is only temporary unless you save it to an R object. So save the data with no missing ```Fat2``` values in a new data frame called ```mindsetmatters_subset```.

In [None]:
#Write code here to make your new dataframe mindsetmatters_subset without any NA in Fat2


What do you think would happen if we saved ```filter(mindsetmatters, Fat2 != "NA")``` to ```mindsetmatters```? If we save the filtered data to the same dataframe name, it will overwrite the original dataframe. Doing this would mean we can no longer access the data we filtered out. This process is called **destructive editing**. In general, we don't want to do this - we don't want to permanently erase any data. Thus if you're manipulating data in a dataframe, it's usually best to save the manipulated data as a new object. That way you can always access the original data again if needed. 

### Creating new variables 

Often we use multiple measures of a single attribute because no single measure would be adequate. For instance, it would be difficult to measure school achievement with a grade score from just your English class. However, if you do have multiple measures, you probably will want to combine them into a single variable. In the case of school achievement, a good summary measure might be the average grade earned across all of a student’s courses.

It is quite common to create new variables that mutate values from other variables. For example, in ```mindsetmatters```, we have a measurement for both someone's weight (```Wt```)  and their BMI score (```BMI```).   We don't know their height, but BMI is a function of weight / height. So, we could divide ```Wt``` by ```BMI``` to make a new variable ```Ht```. 

Making a new variable is easy. Simply assign the computed vector to a new variable naming using the ```$``` symbol after the dataframe name. E.g., ```mindsetmatters$Ht <- mindsetmatters$Wt / mindsetmatters$BMI```. If the variable ```Ht``` already exists in the dataframe, R will overwrite it with these new values. If it doesn't, R will create a new variable and append it as a new column to the end of ```mindsetmatters```.

In [None]:
#make a new dataframe separate from the original
mindsetmatters2 <- mindsetmatters

#Look at all the variables in mindsetmatters
head(mindsetmatters)

# Create a new variable in mindsetmatters2 called Ht from Wt and BMI
mindsetmatters2$Ht <- mindsetmatters2$Wt / mindsetmatters2$BMI

#Look at the updated dataframe, and find where the new variable went
head(mindsetmatters2)

Whenever you make new variables, or even do anything else in R, it’s a good idea to check to make sure R did what you intended it to do. You can use the ```head()``` function for this. 

Sometimes you also might want to create a new variable that represents the data in a different way. For example, maybe in one analysis we want to look at only housekeepers that are at least 40 years old. You can make a new boolean variable that is ```true``` for everyone at least 40 years old, and ```false``` for everyone younger.

In [None]:
# making a new boolean variable to mark who is at least 40 years old
mindsetmatters2$Older <- mindsetmatters$Age >= 40
head(mindsetmatters2)

In [None]:
# Use count() to find out how many people are at least 40 years old 

# Why are there three possible data values in the output, instead of just true and false?

In [None]:
# Write some more code below, using filter(), to make a minsetmatters_subset dataframe of only
# participants who are at least 40 years old. Does the size of this dataframe match what you 
# expect from the code above?


Another way of creating variables, but using a function instead of the ```$``` operator, is the ```mutate()``` function from the ```dplyr``` package. When you use ```mutate()```, you need typically to specify 3 things:

- the dataframe you want to modify
- the name of the new variable that you’ll create
- the value you will assign to the new variable

Here is code that will make the same ```Ht``` variable we created earlier:

In [None]:
mindsetmatters2 <- mutate(mindsetmatters, Ht = Wt/BMI)
head(mindsetmatters2)

```mutate()``` – like all of the ```dplyr``` functions – strictly operates on dataframes. It’s not set up to work with lists, matrices, vectors, or other data structures. Thus this code is changing a dataset ("mutating" it) to have a new variable. The first argument is the dataframe to mutate, and the second argument is called a "name-value pair". ```Ht``` is not a pre-defined argument in the ```mutate()``` function, but is a "name" you are choosing to which you assign a new "value." 

### Recoding variables

There are some instances where you may want to change the way a variable is coded. For instance, the variable ```Cond``` is coded 1 for if a housekeeper was in the experimental condition (told that they are getting enough exercise), or the control condition (not told anything). Recall the ```factor()``` function we learned last chapter, and turn this variable into a factor in ```mindsetmatters2```.

In [None]:
#Identify the data type of Cond
str(mindsetmatters2)

#Recode Cond into a factor, and save it over the original Cond variable


#Identify the data type of Cond again. Did it change?
str(mindsetmatters2)

Remember that some functions take multiple arguments (such as the dataframe and the variable name), or one argument (just the variable, accessed with ```$```). This means you'll have to remember how functions work, or keep a cheatsheet you can refer back to. 

What if you don't want to change only the data type, but also the values in a variable? We learned last chapter that the function ```factor()``` can do this with its ```levels``` and ```labels``` arguments. Another way to do it is using the function ```recode()``` like this:

```recode(mindsetmatters2$Cond, "1"="exp", "0"="control")```

In R, there are often multiple ways to do the same thing! In the ```recode()``` function, you need to put the old value in quotes; the new variable could be in quotes (if a character value) or not (if numerical). This function also uses the ```=``` assignment operator instead of ```<-```. This is just for telling R a momentary command inside an argument - you're not permanently assigning 1 to have the value of "exp". Keep using ```<-``` for assigning objects that stick around.

As always, whenever we do anything, we might want to save it. Try saving the recoded version of ```Cond``` as ```CondRecode```, a new variable in ```mindsetmatters2```. Print a few observations of ```Cond``` and ```CondRecode``` to check that your recode worked.


In [None]:
# Save the recoded version of `Cond` to `CondRecode`
mindsetmatters2$CondRecode <- recode(     )

# Write code to print the first 6 observations of `Cond` and `CondRecode` only



## Chapter summary

After reading this chapter, you should be able to:

- load a dataset into R
- Explain what the rows and columns of a dataframe mean
- Access particular variables in a dataframe
- Use select() and filter() to subset a dataframe
- Identify missing data
- Create and add a new variable to a dataframe
- Recode a variable
- Explain why we don't want to do destructive editing

## Chapter quiz

Questions from 2.7 coursekata