swcarpentry · naupaka · May 16, 2023 · May 8, 2023 · May 12, 2023 · May 12, 2023
diff --git a/episodes/05-data-structures-part2.Rmd b/episodes/05-data-structures-part2.Rmd
@@ -8,10 +8,7 @@ source: Rmd
 ::::::::::::::::::::::::::::::::::::::: objectives
 
 - Add and remove rows or columns.
-- Remove rows with `NA` values.
 - Append two data frames.
-- Understand what a `factor` is.
-- Convert a `factor` to a `character` vector and vice versa.
 - Display basic properties of data frames including size and class of the columns, names, and first few rows.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
@@ -37,7 +34,7 @@ data are consistent in type throughout the columns. As such, if we want to add a
 new column, we can start by making a new vector:
 
 ```{r, echo=FALSE}
-cats <- read.csv("data/feline-data.csv", stringsAsFactors = TRUE)
+cats <- read.csv("data/feline-data.csv")
 ```
 
 ```{r}
@@ -84,96 +81,32 @@ newRow <- list("tortoiseshell", 3.3, TRUE, 9)
 cats <- rbind(cats, newRow)
 ```
 
-Looks like our attempt to use the `rbind()` function returns a warning.  Recall that, unlike errors, warnings do not necessarily stop a function from performing its intended action.  You can confirm this by taking a look at the `cats` data frame.
+Let's confirm that our new row was added correctly. 
 
 ```{r}
 cats
 ```
 
-Notice that not only did we successfully add a new row, but there is `NA` in the column *coats* where we expected "tortoiseshell" to be.  Why did this happen?
-
-## Factors
-
-For an object containing the data type `factor`, each different value represents what is called a `level`. In our case, the `factor` "coat" has 3 levels: "black", "calico", and "tabby". R will only accept values that match one of the levels. If you add a new value, it will become `NA`.
-
-The warning is telling us that we unsuccessfully added "tortoiseshell" to our
-*coat* factor, but 3.3 (a numeric), TRUE (a logical), and 9 (a numeric) were
-successfully added to *weight*, *likes\_string*, and *age*, respectively, since
-those variables are not factors. To successfully add a cat with a
-"tortoiseshell" *coat*, add "tortoiseshell" as a possible *level* of the factor:
-
-```{r}
-levels(cats$coat)
-levels(cats$coat) <- c(levels(cats$coat), "tortoiseshell")
-cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
-```
-
-Alternatively, we can change a factor into a character vector; we lose the
-handy categories of the factor, but we can subsequently add any word we want to the
-column without babysitting the factor levels:
-
-```{r}
-str(cats)
-cats$coat <- as.character(cats$coat)
-str(cats)
-```
-
-:::::::::::::::::::::::::::::::::::::::  challenge
-
-## Challenge 1
-
-Let's imagine that 1 cat year is equivalent to 7 human years.
-
-1. Create a vector called `human_age` by multiplying `cats$age` by 7.
-2. Convert `human_age` to a factor.
-3. Convert `human_age` back to a numeric vector using the `as.numeric()` function. Now divide it by 7 to get the original ages back. Explain what happened.
-
-:::::::::::::::  solution
-
-## Solution to Challenge 1
-
-1. `human_age <- cats$age * 7`
-2. `human_age <- factor(human_age)`. `as.factor(human_age)` works just as well.
-3. `as.numeric(human_age)` yields `1 2 3 4 4` because factors are stored as integers (here, 1:4), each of which is associated with a label (here, 28, 35, 56, and 63). Converting the factor to a numeric vector gives us the underlying integers, not the labels. If we want the original numbers, we need to convert `human_age` to a character vector (using `as.character(human_age)`) and then to a numeric vector (why does this work?). This comes up in real life when we accidentally include a character somewhere in a column of a .csv file supposed to only contain numbers, and set `stringsAsFactors=TRUE` when we read in the data.
-
-
-
-:::::::::::::::::::::::::
-
-::::::::::::::::::::::::::::::::::::::::::::::::::
 
 ## Removing rows
 
-We now know how to add rows and columns to our data frame in R—but in our
-first attempt to add a "tortoiseshell" cat to the data frame we have accidentally
-added a garbage row:
+We now know how to add rows and columns to our data frame in R. Now let's learn to remove rows. 
 
 ```{r}
 cats
 ```
 
-We can ask for a data frame minus this offending row:
+We can ask for a data frame minus the last row:
 
 ```{r}
 cats[-4, ]
 ```
 
 Notice the comma with nothing after it to indicate that we want to drop the entire fourth row.
 
-Note: we could also remove both new rows at once by putting the row numbers
-inside of a vector: `cats[c(-4,-5), ]`
-
-Alternatively, we can drop all rows with `NA` values:
-
-```{r}
-na.omit(cats)
-```
-
-Let's reassign the output to `cats`, so that our changes will be permanent:
+Note: we could also remove several rows at once by putting the row numbers
+inside of a vector, for example: `cats[c(-3,-4), ]`
 
-```{r}
-cats <- na.omit(cats)
-```
 
 ## Removing columns
 
@@ -215,7 +148,7 @@ cats
 
 :::::::::::::::::::::::::::::::::::::::  challenge
 
-## Challenge 2
+## Challenge 1
 
 You can create a new data frame right from within R with the following syntax:
 
@@ -236,7 +169,7 @@ Finally, use `cbind` to add a column with each person's answer to the question,
 
 :::::::::::::::  solution
 
-## Solution to Challenge 2
+## Solution to Challenge 1
 
 ```{r}
 df <- data.frame(first = c("Grace"),
@@ -257,7 +190,7 @@ now let's use those skills to digest a more realistic dataset. Let's read in the
 `gapminder` dataset that we downloaded previously:
 
 ```{r}
-gapminder <- read.csv("data/gapminder_data.csv", stringsAsFactors = TRUE)
+gapminder <- read.csv("data/gapminder_data.csv")
 ```
 
 :::::::::::::::::::::::::::::::::::::::::  callout
@@ -272,18 +205,20 @@ gapminder <- read.csv("data/gapminder_data.csv", stringsAsFactors = TRUE)
 
 ```{r, eval=FALSE, echo=TRUE}
 download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
-gapminder <- read.csv("data/gapminder_data.csv", stringsAsFactors = TRUE)
+gapminder <- read.csv("data/gapminder_data.csv")
 ```
 
 - Alternatively, you can also read in files directly into R from the Internet by replacing the file paths with a web address in `read.csv`. One should note that in doing this no local copy of the csv file is first saved onto your computer. For example,
 
 ```{r, eval=FALSE, echo=TRUE}
-gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", stringsAsFactors = TRUE)
+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv")
 ```
 
 - You can read directly from excel spreadsheets without
   converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package.
 
+- The argument "stringsAsFactors" can be useful to tell R how to read strings either as factors or as character strings. In R versions after 4.0, all strings are read-in as characters by default, but in earlier versions of R, strings are read-in as factors by default. For more information, see the call-out in [the previous episode](https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html#callout2). 
+
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -294,10 +229,10 @@ out what the data looks like with `str`:
 str(gapminder)
 ```
 
-An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Factor columns are summarized by the number of items in each level, numeric or integer columns by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode.
+An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Numeric or integer columns are described by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode.
 
 ```{r}
-summary(gapminder$country)
+summary(gapminder)
 ```
 
 Along with the `str` and `summary` functions, we can examine individual columns of the data frame with our `typeof` function:
@@ -361,15 +296,15 @@ head(gapminder)
 
 :::::::::::::::::::::::::::::::::::::::  challenge
 
-## Challenge 3
+## Challenge 2
 
 It's good practice to also check the last few lines of your data and some in the middle. How would you do this?
 
 Searching for ones specifically in the middle isn't too hard, but we could ask for a few lines at random. How would you code this?
 
 :::::::::::::::  solution
 
-## Solution to Challenge 3
+## Solution to Challenge 2
 
 To check the last few lines it's relatively simple as R already has a function for this:
 
@@ -398,7 +333,7 @@ into a script file so we can come back to it later.
 
 :::::::::::::::::::::::::::::::::::::::  challenge
 
-## Challenge 4
+## Challenge 3
 
 Go to file -> new file -> R script, and write an R script
 to load in the gapminder dataset. Put it in the `scripts/`
@@ -409,7 +344,7 @@ as its argument (or by pressing the "source" button in RStudio).
 
 :::::::::::::::  solution
 
-## Solution to Challenge 4
+## Solution to Challenge 3
 
 The `source` function can be used to use a script within a script.
 Assume you would like to load the same type of file over and over
@@ -422,7 +357,7 @@ Check out `?source` to find out more.
 
 ```{r, eval=FALSE}
 download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
-gapminder <- read.csv(file = "data/gapminder_data.csv", stringsAsFactors = TRUE)
+gapminder <- read.csv(file = "data/gapminder_data.csv")
 ```
 
 To run the script and load the data into the `gapminder` variable:
@@ -437,21 +372,21 @@ source(file = "scripts/load-gapminder.R")
 
 :::::::::::::::::::::::::::::::::::::::  challenge
 
-## Challenge 5
+## Challenge 4
 
 Read the output of `str(gapminder)` again;
-this time, use what you've learned about factors, lists and vectors,
+this time, use what you've learned about lists and vectors,
 as well as the output of functions like `colnames` and `dim`
 to explain what everything that `str` prints out for gapminder means.
 If there are any parts you can't interpret, discuss with your neighbors!
 
 :::::::::::::::  solution
 
-## Solution to Challenge 5
+## Solution to Challenge 4
 
 The object `gapminder` is a data frame with columns
 
-- `country` and `continent` are factors.
+- `country` and `continent` are character strings.
 - `year` is an integer vector.
 - `pop`, `lifeExp`, and `gdpPercap` are numeric vectors.
 
@@ -464,8 +399,6 @@ The object `gapminder` is a data frame with columns
 - Use `cbind()` to add a new column to a data frame.
 - Use `rbind()` to add a new row to a data frame.
 - Remove rows from a data frame.
-- Use `na.omit()` to remove rows from a data frame with `NA` values.
-- Use `levels()` and `as.character()` to explore and manipulate factors.
 - Use `str()`, `summary()`, `nrow()`, `ncol()`, `dim()`, `colnames()`, `rownames()`, `head()`, and `typeof()` to understand the structure of a data frame.
 - Read in a csv file using `read.csv()`.
 - Understand what `length()` of a data frame represents.