Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird mean() error in beginning of the dplyr episode #558

Open
rkclement opened this issue Aug 29, 2019 · 1 comment

Comments

@rkclement
Copy link

commented Aug 29, 2019

I have taken to showing students read_csv() instead of read.csv() when we start working with the gapminder data. Today I ran into a strange error and want to know if others have seen this and have any idea how old it is (I think this code worked back in February 2019).

If I read in the gapminder data with:
gapminder <- read_csv("data/gapminder_data.csv")

When I get to the beginning of the dplyr lesson and try to run:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])

I get the following error:
Warning message: In mean.default(gapminder[gapminder$continent == "Africa", "gdpPercap"]) : argument is not numeric or logical: returning NA

There are no NAs in the data, as confirmed by:
sum(is.na(gapminder[gapminder$continent == "Africa", "gdpPercap"]))

Interestingly, if I read read the data with read.csv() the problem disappears:
gapminder <- read.csv("data/gapminder_data.csv") mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])

Also, If I just do (i.e. without the square brackets for subsetting):
gapminder <- read_csv("data/gapminder_data.csv") mean(gapminder$gdpPercap)
the problem also disappears.

Finally, though this seems to be related to reading in the data as a tibble, the following also does not work:
gapminder <- read_csv("data/gapminder_data.csv") mean(as.data.frame(gapminder[gapminder$continent == "Africa", "gdpPercap"]))

As best I can tell, this is related to the fact that a tibble, when you use brackets to subset, gives you another tibble out, whereas a data frame, when you use brackets to subset, gives you a vector. However, I can't even use as.vector() to fix this (i.e., this doesn't work: as.vector(gapminder[gapminder$continent == "Africa", "gdpPercap"], mode = 'numeric')

Is there any way around this issue, or can you just not effectively use brackets to subset tibbles?

@fmichonneau

This comment has been minimized.

Copy link
Member

commented Sep 6, 2019

hi @rkclement!

Your diagnostic is correct, you can't compute the mean on a data frame object:

mean(iris)
# [1] NA
# Warning message:
# In mean.default(iris) : argument is not numeric or logical: returning NA

If you import gapminder using read.csv, the gapminder object is of class data.frame:

gapminder <- read.csv("data/gapminder_data.csv")
class(gapminder)
# [1] "data.frame"

When you evaluate: gapminder[gapminder$continent == "Africa", "gdpPercap"], R then coerces the result to a vector. Indeed, a hidden argument of the [ function called drop (set to TRUE by default) converts automatically a 1-column data frame into an atomic vector.

class(gapminder[gapminder$continent == "Africa", "gdpPercap"])
# [1] "numeric"
class(gapminder[gapminder$continent == "Africa", "gdpPercap", drop=FALSE])
# [1] "data.frame"

The tidyverse is designed to make data frames the data structure of choice, and strives for limiting surprises caused by coercions. Therefore, tibbles never do this type of coercion by default.

gapminder_ti <- read_csv("data/gapminder_data.csv")
class(gapminder_ti[gapminder_ti$continent == "Africa", "gdpPercap"])
# [1] "tbl_df"     "tbl"        "data.frame" # this is the class of a tibble

How can you calculate the mean then?

  1. you can use drop = TRUE:
mean(gapminder_ti[gapminder_ti$continent == "Africa", "gdpPercap", drop=TRUE])
[1] 2193.755
  1. you can extract the list elements from the data frame:
mean(gapminder_ti[["gdpPercap"]][gapminder_ti$continent == "Africa"])
[1] 2193.755
  1. you can use the tidyverse functions:
library(dplyr)
gapminder_ti %>%
  filter(continent == "Africa") %>%
  pull(gdpPercap) %>%
  mean()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.