# DS102 Statistical Programming in R : Lesson - Data Frames

### Table of Contents <a class="anchor" id="DS102L6_toc"></a>

* [Table of Contents](#DS102L6_toc)
    * [Page 1 - Introduction](#DS102L6_page_1)
    * [Page 2 - Creating and Viewing Data Frames](#DS102L6_page_2)
    * [Page 3 - Built-in Data Frames](#DS102L6_page_3)
    * [Page 4 - Importing CSV Data](#DS102L6_page_4)
    * [Page 5 - Importing MS Excel Files](#DS102L6_page_5)
    * [Page 6 - Manipulating Data](#DS102L6_page_6)
    * [Page 7 - Filtering Data](#DS102L6_page_7)
    * [Page 8 - Ordering Data](#DS102L6_page_8)
    * [Page 9 - Selecting and Manipulating Data](#DS102L6_page_9)
    * [Page 10 - Grouping and Summarizing Data](#DS102L6_page_10)
    * [Page 11 - Graphing Data Grouped by Factors](#DS102L6_page_11)
    * [Page 12 - Key Terms](#DS102L6_page_12)
    * [Page 13 - Hands-On](#DS102L6_page_13)
    * [Page 14 - Hands-On Solution](#DS102L6_page_14)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS102L6_page_1"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('247057946', width=720, height=480)

A data frame is a very powerful way of representing a set of data in R; it is the fundamental type of object upon which many of R's analysis tools work. You saw previously that the ```ggplot2``` package uses data frames. In this lesson, you will use data frames to manipulate data and select observations.

Conceptually, a data frame can be thought of as a table of data. Typically, each column in the data frame represent a variable. Each row represents an observation that has values for some or all of the columns.

Note that the term *variable* gets used in two ways: 

1. A column name in a data frame. 
2. A storage location. 

Typically, context will help you deduce with type of variable is meant.  Nine times out of ten, from here on out, the first one will be used!

In this lesson, you will learn how to: 

* Create a data frame
* Access built-in data frames
* Import data frames from ```.csv``` and ```.xlsx``` files
* Manipulate (wrangle) data using the ```dplyr``` package 

For the hands on, you will access one of R's built in data frames to summarize and graph data about cars. 


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Creating a Viewing Data Frame<a class="anchor" id="DS102L6_page_2"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327990142', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg2tutorial.zip)**.

# Creating and Viewing Data Frame

To begin creating a data frame, suppose you collect information on four of your friends and organize it into the following table:

<table class="table table-striped">
    <tr>
        <th>Name</th>
        <th>Age</th>
        <th>Dominant Hand</th>
    </tr>
    <tr>
        <td>Bob</td>
        <td>36</td>
        <td>Right</td>
    </tr>
    <tr>
        <td>Nancy</td>
        <td>31</td>
        <td>Right</td>
    </tr>
    <tr>
        <td>Cyrus</td>
        <td>26</td>
        <td>Left</td>
    </tr>
    <tr>
        <td>Jackie</td>
        <td>34</td>
        <td>Right</td>
    </tr>
</table>

In this table, the top row is comprised of a title for each column. This is sometimes called a *header row*. The header row indicates what variables will be in the data frame. Each subsequent row represents three pieces of information about a particular friend: their name, their age, and whether they are right or left handed. Each row represents an *observation*.

You can create a data frame that represents this table with the following commands. You first make lists of the contents of the columns; you store each list in a variable with the column name. You then use these lists as arguments to the ```data.frame()``` function.

```{r}
Name <- c("Bob", "Nancy", "Cyrus", "Jackie")
Age <- c(36, 31, 26, 34)
Dominant_Hand <- c("Right", "Right", "Left", "Right")
friends <- data.frame(Name, Age, Dominant_Hand)
```

---

## Viewing your Data Frame

There are a few different ways to view the data frame you have just created.  

---

### 1. In the Console, in its Entirety

After creating the data frame, you can type the name of the data frame, ```friends```, to see the contents of the data frame you have created:

```{r}
friends
```

Name Age Dominant_Hand

1 Bob 36 Right

2 Nancy 31 Right

3 Cyrus 26 Left

4 Jackie 34 Right

---

### 2. In the Console, by Variable

You can see the data in each column by using the ```$``` syntax. For example,

```{r}
friends$Age
```

[1] 36 31 26 34

```{r}
friends$Name
```

[1] Bob Nancy Cyrus Jackie

Levels: Bob Cyrus Jackie Nancy

```{r}
friends$Dominant_Hand
```

[1] Right Right Left Right

Levels: Left Right

---

### 3. Accessing Individual Elements of the Data Frame

Elements of the data frame can be accessed in many ways. You can use array index notation to access an element. For example, if you want to access the element in the third row and the second column, you could type:

```{r}
friends[3,2]
```

[1] 26
---

### 4. In the View Pane

Once you have created the data frame, you can also click on it in the Environment pane, located on the right hand side of your RStudio.  Clicking once yields a beautifully laid out data frame that is easy to read and will show all of your data points and columns.  Even if they don't fit in the view pane, there will be scroll buttons.  Here's what the view pane data frame looks like: 

![The data frame in R Studio showing data points and columns. The title on the tab reads friends. Column headings are name, age, dominant hand. Row one, Bob, thirty six, right. Row two, Nancy, thirty one, right. Row three, Cyrus, twenty six, left. Row four, Jackie, thirty four, right.](Media/dataFrames1.png)

You can also have the view pane automatically pop up, without needing to click on it in the Environment pane, with the ```View()``` function. For example:

```{r}
View(friends)
```

will also display the above. 

---

## Levels and Factors

What is a *factor*? In short, factor is R's word for a categorical variable.  *Levels* are the different categorical options for the data that can be contained within the factor. 

So the ```friends$Name``` factor can have four levels, namely the four names of your friends. The ```friends$Dominant_Hand``` factor can have two levels: Right or Left.

---

## Adding Columns to a Data Frame

You can add a new column to a data frame. Suppose that you want to add ```shoe size``` to the information you have compiled in the ```friends``` data frame. You could do this with this command:

```{r}
friends$Shoe_Size <- c(10,8,14,9)
```

This code states that you are creating a vector full of the values 10, 8, 14, and 9 using the ```c()``` function and that it will be appended into the ```friends``` data frame under the ```Shoe_Size``` column.

After executing this command, view the new data frame:

![The data frame in R Studio showing data points and columns. The title on the tab reads friends. Column headings are name, age, dominant hand, shoe size. Row one, Bob, thirty six, right, ten. Row two, Nancy, thirty one, right, eight. Row three, Cyrus, twenty six, left, fourteen. Row four, Jackie, thirty four, right, nine.](Media/dataFrames2.png)

The ```Shoe_Size``` column has been added to the right of the data frame and populated with values from the vector.

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [4]:
try:
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *
except:
    !pip install DS_Students
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *

In [5]:
try:
    display(L6P2Q1, L6P2Q2, L6P2Q3)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Working, Herding, and …

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. True', 0), ('B. False'…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. c()', 0), ('B. []', 1)…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Built-in Data Frames<a class="anchor" id="DS102L6_page_3"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [6]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327990125', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg3tutorial.zip)**.

# Built-in Data Frames

Creating a data frame as you did at the beginning of this lesson is very cumbersome for all but the simplest of data. Generally, you will not create data frames this way. Instead, you will import them into R from an MS Excel or ```.csv``` file. But you may also work with built-in R data frames, especially for practice.

Several of the built-in datasets in R are data frames. This includes the ```mtcars``` dataset. This is a dataset taken from the Motor Trend magazine in 1974. You can see this data set by typing ```mtcars``` into the console. R will print the dataset for you. You can print just the first six rows of the data frame using the ```head()``` function.

```{r}
head(mtcars)
```

```text
mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

You can also print the last six rows of the data with the ```tail()``` function.

```{r}
tail(mtcars)
```

Each row of this data set represents an observation; each observation is labeled by the name of the car from which the data comes. You can get more information about what each of the columns in the data set represents with the ```help()``` function: 

```{r}
help(mtcars)
```

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Want to see all the built-in datasets in R?</h3>
    </div>
    <div class="panel-body">
        <p>They are available with the R Datasets Package and their names are shown <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html">here.</a></p>
    </div>
</div>

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [7]:
try:
    display(L6P3Q1, L6P3Q2)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. mpg', 0), ('B. Valiant…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. View(mtcars)', 0), ('B…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Importing CSV Data<a class="anchor" id="DS102L6_page_4"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [8]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('247057946', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg4tutorial.zip)**.

# Importing CSV Data

Building a data frame by creating its component vectors as you've done previously is fine when the data set is small. But for data sets of any appreciable size, it is very difficult. Fortunately, R has facilities to import data sets in several different formats. You will cover how to import a ```.csv``` file.

Generally, the data is collected and cleaned up as necessary using a tool other than R: a spreadsheet, for example. Once this is done, the data is saved in a file and imported into R.

---

## Importing a CSV File

CSV is an acronym for Comma Separated Values. A ```.csv``` file is a text file with rows of data values, separated by commas.

---

### Importing Using read.csv()

The R command to import a ```.csv``` file is ```read.csv()```. A file ```Pets.csv``` is available at **[this link.](https://repo.exeterlms.com/documents/V2/DataScience/Stat-Prog-R/PetsCSV.zip)**

You can import this file into a data frame in the variable ```my_pets``` using the ```read.csv``` function:

```{r}
my_pets <- read.csv("PetsCSV.csv")
```

If you type this command into R, you will get an error message if the ```Pets.csv``` file is not in your current working directory. (A directory is another name for a folder.) R will try to import data files and run R programs from your working directory.

When R starts up, it sets your home directory as your working directory. You will probably need to change this to the directory where you are doing your work. You can do this in RStudio by going to the ```Session``` menu, then the ```Set Working Directory``` submenu, and the ```Choose Directory``` command on this menu. Choose the directory of the file you wish to import.

![R Studio dropdown menu. Session is selected, which presents a dropdown menu in which set working directory is selected, which presents another dropdown menu in which choose directory is selected.](Media/dataFrames3.png)

---

### Importing Using the Environment Pane

In RStudio, you can also import data at the click of a few buttons using a wizard! This is a nice feature because it allows you to preview your data and set headers.  In the Environment pane, you will see a button for ```Import Dataset```.  Clicking on the down arrow yields this menu: 

![The environment panel in R Studio. The button import dataset has been selected and has produced a dropdown menu with the options from text open parentheses base close parentheses, from text open parentheses reader closer parentheses, from excel, from S P S S, from S A S, and from Stata. From text open parentheses base close parentheses is selected.](Media/dataFrames4.png)

You want the ```From Text(base)``` option that is highlighted.

Once you click on it, you can navigate to where your file is stored and select the file.  Then, you will see this window: 

![The import dataset window that appears once from text open parentheses base close parentheses is selected from the environment pane in R Studio and once you navigate to the file you want to open. The window shows the name of the file you have chosen and options for encoding, heading, row names, separator, decimal, quote, comment, and N A dot strings. The window allows the user to choose strings as factors or not. In the upper right of the window is the content of the input file. In the bottom right of the window is the data frame.](Media/dataFrames5.png)

The original file is shown on the top right hand side, and a preview of the data frame R will create is on the bottom right. You'll want to make sure that the radio button ```Yes``` is clicked for ```Heading``` if you have one.  The other important feature to note here is the last checkbox at the bottom, for ```Strings as factors```. Most of the time, it is a good idea to leave this checked, as it will turn your character data into a categorical variable that can be used for most analyses.  However, a few commands need to be done with strings only, so if you come up with errors, you may need to re-import without this check mark.

Once you've changed all the options you need, hit the ```Import``` button and away you go!

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [9]:
try:
    display(L6P4Q1, L6P4Q2, L6P4Q3)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Space', 0), ('B. Pipe'…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. True', 0), ('B. False'…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Your RStudio interface…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Importing MS Excel Files<a class="anchor" id="DS102L6_page_5"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [10]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('328060080', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg5tutorial.zip)**.

# Importing an MS Excel File

MS Excel is very common in businesses and you may receive data in MS Excel in a ```.xlsx``` format.  Although you could convert it into a ```.csv``` file by using the ```Save as``` button in MS Excel, R will also support importing files directly from MS Excel in the ```.xlsx``` format.  

As with ```.csv``` files, there are two ways to import MS Excel data.  

---

### Importing Using read_excel()

The R command to import a ```.xlsx``` file is ```read_excel()```. The ```Pets``` file is **[now available as an MS Excel document](https://repo.exeterlms.com/documents/V2/DataScience/Stat-Prog-R/Pets.zip)** for you to try.

You can import this file into a data frame in the variable ```my_petsExcel``` using the ```read_excel``` function:

```{r}
library(readxl)
my_petsExcel <- read_excel("Pets.xlsx")
```

Just like with CSV files, you may end up with an error message if you have not placed the new MS Excel file in R's working directory.  If you type this command into R, you will get an error message if the ```Pets.csv``` file is not in your current working directory. 

---

### Importing Using the Environment Pane

There is also an option to import MS Excel files in the Environment pane.  Once you click the button for ```Import Dataset```, you will want to choose the ```From Excel``` option: 

![The environment panel in R Studio. The button import dataset has been selected and has produced a dropdown menu with the options from text open parentheses base close parentheses, from text open parentheses reader closer parentheses, from excel, from S P S S, from S A S, and from Stata. From excel is selected.](Media/dataFrames6.png)

Once you click on it, you will use the ```Browse``` button in the upper right hand corner to navigate to where your file is stored and select the file.  Then, you will see this window: 

![The import excel data window that appears once from excel is selected from the environment pane in R Studio and once you navigate to the file you want to open. The window shows the path and name of the file, a data preview, and import options for name, sheet, range, max rows, and skips. It allows the user to choose first row as names or not, and open data viewer or not. In the bottom right is a code preview.](Media/dataFrames7.png)

The data preview shows below, and it even tells you what data types each variable will come in as by default and allows you to change it.  R also gives you an option to either Include that variable or skip it, so you can easily choose what data you want to go into your data frame.

![The import excel data window that appears once from excel is selected from the environment pane in R Studio and once you navigate to the file you want to open. The window shows the path and name of the file, a data preview, and import options for name, sheet, range, max rows, and skips. It allows the user to choose first row as names or not, and open data viewer or not. In the bottom right is a code preview.](Media/dataFrames8.png)

There is also a header option for MS Excel files as well.  The ```First Row as Names``` checkbox will provide automatic headings for you as long as it is checked. 

Once you've changed all the options you need, hit the ```Import``` button and away you go!

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [11]:
try:
    display(L6P5Q1)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Exports CSV files into…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Manipulating Data<a class="anchor" id="DS102L6_page_6"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [12]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327990281', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg6tutorial.zip)**.

# Manipulating Data

One of R's strengths is the ability to manipulate data to facilitate statistical analysis and to extract meaning and information from the data. Consider the following types of manipulation:

* **Filtering:** Selecting a subset of observations that meet specific criteria from a data frame.
* **Ordering:** Arranging the observations in a data frame into a particular order according to given criteria.
* **Selection:** Choosing certain variables from a data frame.
* **Mutation:** Computing new variables from variables that already exist in the data frame.
* **Grouping:** Creating groups of observations that have given common characteristics.
* **Summarizing:** Describing observations using sample statistics.

The base R language has sophisticated functions that will perform each of these manipulations. Unfortunately, like many things in the base R language, learning to use these functions effectively can be frustrating, difficult, and time consuming.

In this lesson, instead of using the base R language functions, you will be introduced and use the ```dplyr``` package to manipulate data. The ```dplyr``` package is part of the ```tidyverse``` collection of R packages; you were introduced the ```tidyverse``` earlier. You can find more information about the ```tidyverse``` **[here](https://www.tidyverse.org/)**. 

First, install ```dplyr``` with the command:

```{r}
install.packages("dplyr")
```

You will only need to install the ```dplyr``` package once; at the start of each R session, you will then need to load the ```dplyr``` library with the command:

```{r}
library("dplyr")
```

This makes the functions in ```dplyr``` available for your use.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch <a href="https://vimeo.com/418281958">this recorded live workshop </a> on the anatomy of dplyr functions, which is meant to go along with the remainder of this lesson.</p>
    </div>
</div>

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [13]:
try:
    display(L6P6Q1, L6P6Q2, L6P6Q3, L6P6Q4, L6P6Q5, L6P6Q6)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Filtering', 0), ('B. S…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Ordering', 0), ('B. Se…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Mutation', 0), ('B. Or…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Ordering', 0), ('B. Fi…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Selection', 0), ('B. S…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Filtering', 0), ('B. M…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Filtering Data<a class="anchor" id="DS102L6_page_7"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [14]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327990180', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg7tutorial.zip)**.

# Filtering

You will use your improved ```my_pets``` data frame to practice data manipulations. 

Suppose you wish to get all observations (rows) that contain goats. You can do this with the following ```filter()``` function:

```{r}
filter(my_pets, Animal == "Goat")
```

Which provides this data back:

```text
Name Animal Age Weight Skin

1 Bruno Goat 4 35 hair

2 Jacko Goat 1 45 hair
```

In this example, ```filter()``` has two arguments. The first is the data frame from which you will get the rows; in this case, it is ```my_pets```. The second is a condition that is used to select rows.  The condition is that the ```Animal``` variable must be equal to ```Goat```.

You can add more conditions as arguments to ```filter()``` if you want to select on multiple variables. For example, to get all the rows containing goats whose age is two or less, you could use the following:

```{r}
filter(my_pets, Animal == "Goat", Age <= 2)
```

Which yields only one goat: 

```text
Name Animal Age Weight Skin

1 Jacko Goat 1 45 hair
```

You can use ```filter()``` to get all the observations of animals heavier than three pounds:

```{r}
filter(my_pets, Weight > 3)
```

```text
Name Animal Age Weight Skin

1 Bruno Goat 4 35.0 hair

2 Jacko Goat 1 45.0 hair

3 Sophie Cat 12 3.5 fur
```

---

### %in%

Suppose you want all the observations of mammals in the data frame. You could do this as follows:

```{r}
filter(my_pets, Animal %in% c("Goat", "Cat", "Guinea Pig"))
```

With the following data meeting those criteria:

```text
Name     Animal Age Weight Skin

1 Bruno Goat 4 35.0 hair

2 Jacko Goat 1 45.0 hair

3 Sophie Cat 12 3.5 fur

4 Patches Guinea Pig 2 2.0 fur
```

This works as follows. ```c("Goat", "Cat", "Guinea Pig")``` creates a vector of the three animal types, and the ```%in%``` code looks for anything in the vector criteria specified. 

In this particular data set, you could also get all of the mammals by getting everything that is not a gold fish.  Put those logical operators to work!

```{r}
filter(my_pets, Animal != "Gold Fish")
```

```text
Name     Animal Age Weight Skin

1 Bruno Goat 4 35.0 hair

2 Jacko Goat 1 45.0 hair

3 Sophie Cat 12 3.5 fur

4 Patches Guinea Pig 2 2.0 fur
```

---

### Filtering into a Data Frame

Often you will want to save the data frame that is returned by ```filter()``` for further analysis. You can do this by assigning the returned value to a variable. For example:

```{r}
mammals <- filter(my_pets, Animal != "Gold Fish")
mammals
```

```text
Name     Animal Age Weight Skin

1 Bruno Goat 4 35.0 hair

2 Jacko Goat 1 45.0 hair

3 Sophie Cat 12 3.5 fur

4 Patches Guinea Pig 2 2.0 fur
```

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [15]:
try:
    display(L6P7Q1, L6P7Q2, L6P7Q3, L6P7Q4)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. arrange(mtcars, cyl)`'…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. filter(mtcars, hp < 10…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. summarize(mtcars, medi…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. mtcars %>% group_by(cy…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Ordering Data<a class="anchor" id="DS102L6_page_8"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [16]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327990291', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg8tutorial.zip)**.

# Ordering Data

You can use the ```arrange()``` function to order the observations of the data frame in a certain order. For example, if you wanted to sort the rows from youngest to oldest (ascending order), you could do this as follows:

```{r}
arrange(my_pets, Age)
```

This yields the following:

```text
Name     Animal Age Weight   Skin

1 Jacko Goat 1 45.00 hair

2 Patches Guinea Pig 2 2.00 fur

3 Boozer Gold Fish 2 0.03 scales

4 Boozer Gold Fish 3 0.04 scales

5 Bruno Goat 4 35.00 hair

6 Sophie Cat 12 3.50 fur
```

The first argument to ```arrange()``` is the data frame; the second argument is the variable upon which to sort. You can add other variables if you want to sort on them as well.

---

## Sorting in Descending Order

You could sort from oldest to youngest as follows using the ```desc()``` function (which is short for "descending"):

```{r}
arrange(my_pets, desc(Age))
```

And now the data is shown in reverse order:

```text
Name     Animal Age Weight   Skin

1 Sophie Cat 12 3.50 fur

2 Bruno Goat 4 35.00 hair

3 Boozer Gold Fish 3 0.04 scales

4 Patches Guinea Pig 2 2.00 fur

5 Boozer Gold Fish 2 0.03 scales

6 Jacko Goat 1 45.00 hair
```

---

## The Pipe Operator

Suppose you want to select only the mammals, and then sort them from youngest to oldest. You could do the following:

```{r}
mammals <- filter(my_pets, Animal != "Gold Fish")
arrange(mammals, Age)
```

You know have all the mammals sorted in ascending order:

```text
Name     Animal Age Weight Skin

1 Jacko Goat 1 45.0 hair

2 Patches Guinea Pig 2 2.0 fur

3 Bruno Goat 4 35.0 hair

4 Sophie Cat 12 3.5 fur
```

This requires you to create a variable to store the filtered results. You can achieve the same results by using the *pipe operator*, which is ```%>%```.  The pipe operator basically orders things, so it says: "take the data frame ```my_pets``` and then ```filter``` it and then ```arrange``` it." The "and then" parts of that sentence is the pipe operator at work!

```{r}
my_pets %>% filter(Animal != "Gold Fish") %>% arrange(Age)
```

When you are using the pipe operator to connect data frames and functions, you do not include the data frame as the first argument of the function; it instead goes first. Thus, the only argument to ```filter()``` is the condition that selects rows; the only argument to ```arrange()``` is the variable on which you want to sort. The pipe operator can make data analysis easier to understand and read; it feels more like a logical sentence.

You'll notice that whether you do the two commands separately, as above, or with the pipe operator, as below, the results will be the same. 

```text
Name     Animal Age Weight Skin

1 Jacko Goat 1 45.0 hair

2 Patches Guinea Pig 2 2.0 fur

3 Bruno Goat 4 35.0 hair

4 Sophie Cat 12 3.5 fur
```

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Don't confuse the pipe operator %>% with the pipe symbol | . Although they essentially mean the same thing, the | won't work in R.</p>
    </div>
</div>

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [17]:
try:
    display(L6P8Q1, L6P8Q2, L6P8Q3)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. Reduces the number of …

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. select(mtcars, hp)', 0…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. desc()', 0), ('B. asc(…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Selecting and Mutating Data<a class="anchor" id="DS102L6_page_9"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [18]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327990315', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg9tutorial.zip)**.

# Selection

In many cases, you may end up working with data sets that have hundreds of variables. In these cases, only some of the variables will be of interest at any given time. You can select to use only certain variables using the ```select()``` function. As an example, suppose you need only the name and types of the pets; you can select these columns as follows:

```{r}
select(my_pets, Name, Animal)
```

Which provides data with just those two columns, as expected:

```text
Name     Animal

1 Bruno Goat

2 Jacko Goat

3 Sophie Cat

4 Patches Guinea Pig

5 Boozer Gold Fish

6 Boozer Gold Fish
```

---

# Mutation

Mutation is the process of computing new columns in the data frame from existing columns. As an example, suppose you want the weights of the pets in kilograms. You can compute kilograms from pounds by dividing a weight in pounds by 2.20462. To create a new column with weights in kilograms, you use the ```mutate()``` function:

```{r}
mutate(my_pets, Weight_kg = Weight/2.20462)
```

The first argument is the data frame name ```my_pets```, the second argument is ```Weight_kg = Weight/2.20462```.  The variable on the left of the equals sign becomes the new column name, and the formula on the right of the equals sign shows how to compute this new column from the existing columns. The formula can include multiple columns if need be.

```mutate()``` adds the new column to the end of the data frame:

```text
Name     Animal Age Weight   Skin   Weight_kg

1 Bruno Goat 4 35.00 hair 15.87575183

2 Jacko Goat 1 45.00 hair 20.41168092

3 Sophie Cat 12 3.50 fur 1.58757518

4 Patches Guinea Pig 2 2.00 fur 0.90718582

5 Boozer Gold Fish 3 0.04 scales 0.01814372

6 Boozer Gold Fish 2 0.03 scales 0.01360779
```

---

## Adding a Column with the Same Value

Sometimes it can be helpful to add a column in which has the same value for every row. So, for example, you could add a ```Status``` column that indicates that all of the pets are alive:

```{r}
mutate(my_pets, Status = "Alive")
```

Again, the new column is added to the end of the data frame, and you will see that every column is filled with ```Alive``` because no logical operators were used to conditionally fill the column. 

```text
Name     Animal Age Weight   Skin Status

1 Bruno Goat 4 35.00 hair Alive

2 Jacko Goat 1 45.00 hair Alive

3 Sophie Cat 12 3.50 fur Alive

4 Patches Guinea Pig 2 2.00 fur Alive

5 Boozer Gold Fish 3 0.04 scales Alive

6 Boozer Gold Fish 2 0.03 scales Alive
```

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [19]:
try:
    display(L6P9Q1, L6P9Q2)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. select(mtcars, hp, cyl…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. What happens to superh…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Grouping and Summarizing Data<a class="anchor" id="DS102L6_page_10"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [20]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('328466905', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg10tutorial.zip)**.

# Grouping

Grouping is organizing the data into groups of interest that have common characteristics. Grouping is not useful by itself. It actually does nothing to your data when done by itself, but is powerful when used for computing summary statistics or creating plots.

---

# Summarizing

You can get summaries of your data using the ```summarize()``` function. For example, to get the mean age, you can use the command:

```{r}
summarize(my_pets, ave_age = mean(Age))
```

And this is the output you will receive:

```text
ave_age

1 4
```

The first argument is the data frame; the second argument is ```ave_age = mean(Age)```. As with the ```mutate()``` function, the term on the left side of the equals sign gives the name of the column to be created; you are creating something called ```ave_age```. The term on the right of the equals sign is the function to be applied.  In this case, you are applying the mean.

Note that the output of ```summarize()``` is only one row, because it computed the mean of all the rows in the data frame.  It is only one column, too: the value computed by the summary.

---

# Combining Group By and Summarize

Suppose you want to know the average age of each type of animal. To find this, you would combine ```summarize()``` and ```group_by()``` as follows; you will use a pipe ```(%>%)``` to combine them:

```{r}
my_pets %>% group_by(Animal) %>% summarize(ave.age = mean(Age))
```

```text
# A tibble: 4 x 2

Animal ave_age

<fctr>   <dbl>

1 Cat 12.0

2 Goat 2.5

3 Gold Fish 2.5

4 Guinea Pig 2.0
```

The first thing you will notice with this output is that you have created a *tibble*.  What is a tibble, you ask? Well, it is essentially a data frame with some additional information.  You can think of it as a modern alternative to a data frame; the ```tidyverse``` packages all work with tibbles. 

You also get a note in the output telling you what the tibble has been grouped by, ```Animal```, and there are four groups. 

You can see that the results of this summary is a tibble that has four rows and two columns, ```Animal``` and ```ave.age```. Each row shows one type of animal and the mean Age value for that animal type.

---

## Functions for summarize()

There are many functions that can be used with ```summarize()```. Some of these include:

* **mean() :** Computes the mean of the values.

* **median() :** Computes the median of the values.

* **sd() :** Computes the standard deviation of the values.

* **min() and max()** : Compute the minimum and maximum values.

* **n()** : Compute the number of values.

---

### n()

```n()``` is very useful in determining the number of rows in a data frame that meet a given condition. For example, suppose you want to determine how many animals of each type are in the data frame. You can do it as follows:

```{r}
my_pets %>% group_by(Animal) %>% summarize(count = n())
```

That will provide these results: 

```text
# A tibble: 4 x 2

Animal count

<fctr> <int>

1 Cat 1

2 Goat 2

3 Gold Fish 2

4 Guinea Pig 1
```

---

## Grouping by Multiple Variables

Finally, you can group by more than one factor. So, for example, you can group by ```Animal``` and ```Name```, and count the number of pets that have a distinct pair of values:

```{r}
my_pets %>% group_by(Name, Animal) %>% summarize(count = n())
```

That yields this tibble:

```text
# A Tibble: 5 x 3

# Groups: Name [?]

Name     Animal count

<fctr> <fctr> <int>

1 Boozer Gold Fish 2

2 Bruno Goat 1

3 Jacko Goat 1

4 Patches Guinea Pig 1

5 Sophie Cat 1
```

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [21]:
try:
    display(L6P10Q1, L6P10Q2, L6P10Q3)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. A function', 0), ('B. …

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. summarize()', 0), ('B.…

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. True', 0), ('B. False'…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Graphing Data Grouped by Factors<a class="anchor" id="DS102L6_page_11"></a>

[Back to Top](#DS102L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [22]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327990198', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L06-pg11tutorial.zip)**.

# Grouping Data Grouped by Factors

A structure like a data frame allows you to group data by factors to determine what effect the factors have on the data. To illustrate this, you will use the ```morley``` data set that is built into R. The data in this data set are the values of the speed of light measured by Michelson in 1879.

The data were collected in five experiments; in each experiment, the speed of light was measured 20 times. Each measurement is called a ```Run```. The measured value was the speed of light in km/sec; in the morley data set, 299000 is subtracted from the measured value.

You can see the first several rows in this data set with the command ```head(morley)```. This gives:

```
Expt Run Speed  

001 1 1 850  

002 1 2 740  

003 1 3 900  

004 1 4 1070  

005 1 5 930  

006 1 6 850
```
You can also see view it in your Data Pane. ```Expt``` is the number of the experiment, ```Run``` is the number of the specific measurement, and ```Speed``` is the measured value.

You can create a box plot of this data, grouped by ```Expt```, using the following command:
```
ggplot(morley, aes(x = Expt, y = Speed)) + geom_boxplot(aes(group=Expt))
```
```x``` will be the variable you are grouping by, and ```y``` will be the continuous variable to look at. Then you'll specify your grouping variable again in ```(aes(group=)```. These commands give the following box plot: 

![A boxplot for morley dataframe, grouped by Expt. The x-axis on the bottom reads Expt. The y-axis on the left reads Speed. Experiment 1 and 3 show outliers.](Media/L06-SpeedBoxPlot.png)

The argument ```aes(group=Expt)``` tells R to group the data by experiment. All the speed values from Experiment 1 (in other words, each of the speed values from a row in which Expt is equal to 1) are used to create the leftmost box plot. Similarly, all the speed values from Experiment 2 are used to create the next box plot, and so on.

You can find the mean speed for each experiment using the ```group_by()``` and ```summarize()``` functions as follows:
```
morley %>% group_by(Expt) %>% summarize(m.speed = mean(Speed))
```
Which will yield this tibble:
```
# A tibble: 5 x 2

Expt m.speed

<int> <dbl>

1 1 909.0

2 2 856.0

3 3 845.0

4 4 820.5

5 5 831.5
```
From this, you can see that the mean speed from Experiment 1 is 909.0; the mean speed from Experiment 2 is 856.0; etc.

---

# Summary

From here on out, most of the data you will work with in R will be stored in data frames or tibbles, as those frameworks are easily fed into statistical analyses. In data frames, the variables are the columns, and the observations are known as the rows. However, to analyze the exact data you want, you may have to manipulate it some, which is where the ```dplyr``` package comes in. Using ```dplyr```, you can filter down to only the data or variables you're interested in, easily summarize and group data, and even create new variables using ```mutate()```.

---

# Review
Below is a quiz to review the recently covered material. Quizzes are *not* graded.



In [23]:
try:
    display(L6P11Q1)
except:
    pass

VBox(children=(Output(), RadioButtons(layout=Layout(width='max-content'), options=(('A. A tibble summarized by…