Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove factors from 05-data-structures-part2.Rmd #829

Merged
merged 3 commits into from
May 16, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 24 additions & 91 deletions episodes/05-data-structures-part2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,7 @@ source: Rmd
::::::::::::::::::::::::::::::::::::::: objectives

- Add and remove rows or columns.
- Remove rows with `NA` values.
- Append two data frames.
- Understand what a `factor` is.
- Convert a `factor` to a `character` vector and vice versa.
- Display basic properties of data frames including size and class of the columns, names, and first few rows.

::::::::::::::::::::::::::::::::::::::::::::::::::
Expand All @@ -37,7 +34,7 @@ data are consistent in type throughout the columns. As such, if we want to add a
new column, we can start by making a new vector:

```{r, echo=FALSE}
cats <- read.csv("data/feline-data.csv", stringsAsFactors = TRUE)
cats <- read.csv("data/feline-data.csv")
```

```{r}
Expand Down Expand Up @@ -84,96 +81,32 @@ newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
```

Looks like our attempt to use the `rbind()` function returns a warning. Recall that, unlike errors, warnings do not necessarily stop a function from performing its intended action. You can confirm this by taking a look at the `cats` data frame.
Let's confirm that our new row was added correctly.

```{r}
cats
```

Notice that not only did we successfully add a new row, but there is `NA` in the column *coats* where we expected "tortoiseshell" to be. Why did this happen?

## Factors

For an object containing the data type `factor`, each different value represents what is called a `level`. In our case, the `factor` "coat" has 3 levels: "black", "calico", and "tabby". R will only accept values that match one of the levels. If you add a new value, it will become `NA`.

The warning is telling us that we unsuccessfully added "tortoiseshell" to our
*coat* factor, but 3.3 (a numeric), TRUE (a logical), and 9 (a numeric) were
successfully added to *weight*, *likes\_string*, and *age*, respectively, since
those variables are not factors. To successfully add a cat with a
"tortoiseshell" *coat*, add "tortoiseshell" as a possible *level* of the factor:

```{r}
levels(cats$coat)
levels(cats$coat) <- c(levels(cats$coat), "tortoiseshell")
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
```

Alternatively, we can change a factor into a character vector; we lose the
handy categories of the factor, but we can subsequently add any word we want to the
column without babysitting the factor levels:

```{r}
str(cats)
cats$coat <- as.character(cats$coat)
str(cats)
```

::::::::::::::::::::::::::::::::::::::: challenge

## Challenge 1

Let's imagine that 1 cat year is equivalent to 7 human years.

1. Create a vector called `human_age` by multiplying `cats$age` by 7.
2. Convert `human_age` to a factor.
3. Convert `human_age` back to a numeric vector using the `as.numeric()` function. Now divide it by 7 to get the original ages back. Explain what happened.

::::::::::::::: solution

## Solution to Challenge 1

1. `human_age <- cats$age * 7`
2. `human_age <- factor(human_age)`. `as.factor(human_age)` works just as well.
3. `as.numeric(human_age)` yields `1 2 3 4 4` because factors are stored as integers (here, 1:4), each of which is associated with a label (here, 28, 35, 56, and 63). Converting the factor to a numeric vector gives us the underlying integers, not the labels. If we want the original numbers, we need to convert `human_age` to a character vector (using `as.character(human_age)`) and then to a numeric vector (why does this work?). This comes up in real life when we accidentally include a character somewhere in a column of a .csv file supposed to only contain numbers, and set `stringsAsFactors=TRUE` when we read in the data.



:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

## Removing rows

We now know how to add rows and columns to our data frame in R—but in our
first attempt to add a "tortoiseshell" cat to the data frame we have accidentally
added a garbage row:
We now know how to add rows and columns to our data frame in R. Now let's learn to remove rows.

```{r}
cats
```

We can ask for a data frame minus this offending row:
We can ask for a data frame minus the last row:

```{r}
cats[-4, ]
```

Notice the comma with nothing after it to indicate that we want to drop the entire fourth row.

Note: we could also remove both new rows at once by putting the row numbers
inside of a vector: `cats[c(-4,-5), ]`

Alternatively, we can drop all rows with `NA` values:

```{r}
na.omit(cats)
```

Let's reassign the output to `cats`, so that our changes will be permanent:
Note: we could also remove several rows at once by putting the row numbers
inside of a vector, for example: `cats[c(-3,-4), ]`

```{r}
cats <- na.omit(cats)
```

## Removing columns

Expand Down Expand Up @@ -215,7 +148,7 @@ cats

::::::::::::::::::::::::::::::::::::::: challenge

## Challenge 2
## Challenge 1

You can create a new data frame right from within R with the following syntax:

Expand All @@ -236,7 +169,7 @@ Finally, use `cbind` to add a column with each person's answer to the question,

::::::::::::::: solution

## Solution to Challenge 2
## Solution to Challenge 1

```{r}
df <- data.frame(first = c("Grace"),
Expand All @@ -257,7 +190,7 @@ now let's use those skills to digest a more realistic dataset. Let's read in the
`gapminder` dataset that we downloaded previously:

```{r}
gapminder <- read.csv("data/gapminder_data.csv", stringsAsFactors = TRUE)
gapminder <- read.csv("data/gapminder_data.csv")
```

::::::::::::::::::::::::::::::::::::::::: callout
Expand All @@ -272,18 +205,20 @@ gapminder <- read.csv("data/gapminder_data.csv", stringsAsFactors = TRUE)

```{r, eval=FALSE, echo=TRUE}
download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
gapminder <- read.csv("data/gapminder_data.csv", stringsAsFactors = TRUE)
gapminder <- read.csv("data/gapminder_data.csv")
```

- Alternatively, you can also read in files directly into R from the Internet by replacing the file paths with a web address in `read.csv`. One should note that in doing this no local copy of the csv file is first saved onto your computer. For example,

```{r, eval=FALSE, echo=TRUE}
gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", stringsAsFactors = TRUE)
gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv")
```

- You can read directly from excel spreadsheets without
converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package.

- The argument "stringsAsFactors" can be useful to tell R how to read strings either as factors or as character strings. In R versions after 4.0, all strings are read-in as characters by default, but in earlier versions of R, strings are read-in as factors by default. For more information, see the call-out in [the previous episode](https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html#callout2).


::::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -294,10 +229,10 @@ out what the data looks like with `str`:
str(gapminder)
```

An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Factor columns are summarized by the number of items in each level, numeric or integer columns by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode.
An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Numeric or integer columns are described by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode.

```{r}
summary(gapminder$country)
summary(gapminder)
```

Along with the `str` and `summary` functions, we can examine individual columns of the data frame with our `typeof` function:
Expand Down Expand Up @@ -361,15 +296,15 @@ head(gapminder)

::::::::::::::::::::::::::::::::::::::: challenge

## Challenge 3
## Challenge 2

It's good practice to also check the last few lines of your data and some in the middle. How would you do this?

Searching for ones specifically in the middle isn't too hard, but we could ask for a few lines at random. How would you code this?

::::::::::::::: solution

## Solution to Challenge 3
## Solution to Challenge 2

To check the last few lines it's relatively simple as R already has a function for this:

Expand Down Expand Up @@ -398,7 +333,7 @@ into a script file so we can come back to it later.

::::::::::::::::::::::::::::::::::::::: challenge

## Challenge 4
## Challenge 3

Go to file -> new file -> R script, and write an R script
to load in the gapminder dataset. Put it in the `scripts/`
Expand All @@ -409,7 +344,7 @@ as its argument (or by pressing the "source" button in RStudio).

::::::::::::::: solution

## Solution to Challenge 4
## Solution to Challenge 3

The `source` function can be used to use a script within a script.
Assume you would like to load the same type of file over and over
Expand All @@ -422,7 +357,7 @@ Check out `?source` to find out more.

```{r, eval=FALSE}
download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
gapminder <- read.csv(file = "data/gapminder_data.csv", stringsAsFactors = TRUE)
gapminder <- read.csv(file = "data/gapminder_data.csv")
```

To run the script and load the data into the `gapminder` variable:
Expand All @@ -437,21 +372,21 @@ source(file = "scripts/load-gapminder.R")

::::::::::::::::::::::::::::::::::::::: challenge

## Challenge 5
## Challenge 4

Read the output of `str(gapminder)` again;
this time, use what you've learned about factors, lists and vectors,
this time, use what you've learned about lists and vectors,
as well as the output of functions like `colnames` and `dim`
to explain what everything that `str` prints out for gapminder means.
If there are any parts you can't interpret, discuss with your neighbors!

::::::::::::::: solution

## Solution to Challenge 5
## Solution to Challenge 4

The object `gapminder` is a data frame with columns

- `country` and `continent` are factors.
- `country` and `continent` are character strings.
- `year` is an integer vector.
- `pop`, `lifeExp`, and `gdpPercap` are numeric vectors.

Expand All @@ -464,8 +399,6 @@ The object `gapminder` is a data frame with columns
- Use `cbind()` to add a new column to a data frame.
- Use `rbind()` to add a new row to a data frame.
- Remove rows from a data frame.
- Use `na.omit()` to remove rows from a data frame with `NA` values.
- Use `levels()` and `as.character()` to explore and manipulate factors.
- Use `str()`, `summary()`, `nrow()`, `ncol()`, `dim()`, `colnames()`, `rownames()`, `head()`, and `typeof()` to understand the structure of a data frame.
- Read in a csv file using `read.csv()`.
- Understand what `length()` of a data frame represents.
Expand Down