New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring Data Frames: order of objectives #356

Open
eageissinger opened this Issue Mar 14, 2018 · 7 comments

Comments

Projects
None yet
5 participants
@eageissinger

eageissinger commented Mar 14, 2018

I would like to propose rearranging the order of the material in Exploring Data Frames. I think finding the basic properties of the data frames should come before the rest of the objectives. It is important to review the properties of your data frame before any manipulations take place. Therefore, I think learning str(), head(), tail(), dim(), etc., should come at the beginning of the lesson, followed by the data manipulation (adding/removing rows, changing from factor to character, etc.).

@eageissinger eageissinger changed the title from Exploring Data Frames: order of objectives/material within lesson to Exploring Data Frames: order of objectives Mar 14, 2018

@naupaka

This comment has been minimized.

Member

naupaka commented Mar 21, 2018

That seems reasonable to me, but I'd be curious if other had strong feelings on that, since it's a relatively large change.

@mawds

This comment has been minimized.

Collaborator

mawds commented Mar 23, 2018

I think it's a good point, but it would mean we'd jump from cats (at the end of the previous episode) to gapminder (for properties of data frames) and then back to cats (for data manipulation). Personally I'm not keen on switching between data-sets if it can be avoided, as I think it can cause confusion.

We couldn't use the cats data for properties of data-frames, since it's too small for head() and tail() to be used. If we could live with dropping these from the episode we could use the cats data for everything.

(though it would mean the first use of / loading gapminder would need to go in another episode - or perhaps load it at the end of this episode use it for some properties of data-frames exercises? (and if necessary mention head()/tail() there?)

@naupaka

This comment has been minimized.

Member

naupaka commented Mar 23, 2018

Is there a reason why we couldn't switch out the cats data to be a subset of the gapminder data? 3 or 4 rows, 3 or 4 columns of data from spanning a few countries and maybe two continents? Then we could have the proper ordering and no loss of cohesion. I have felt that the cats are a little bit random amidst the rest of it.

@eageissinger

This comment has been minimized.

eageissinger commented Mar 23, 2018

I agree that switching between data sets would be confusing (didn't think of that when I first proposed the change), but Naupaka brings up a good solution with using a subset of the gapminder data. It could definitely allow for more fluidity as we move from one episode to the next.

I am new to the lesson material, and have only seen it in action once, so I acknowledge that this might not be worth the effort of changing. However, I did notice a lot of people taking the course enjoy working with the gapminder data set more than cats, because it has more real-world application.

@emielvanloon

This comment has been minimized.

emielvanloon commented Mar 28, 2018

As mentioned above: when dropping the cats data in this section ('Exploring Data Frames'),
the cats example in the preceding section ('Data Structures') becomes a bit isolated,
and will appear even more contrived.

Still, it is nice for didactic reasons to have a small example data set entered manually and used to explain various aspects of/operations on data structures.
I think it could be elegant to enter a tiny set with life expectancy & population data for four countries by hand.

A nice choice might be:

life1972 <- data.frame( country = c('Chad', 'Cuba', 'Japan', 'Nepal'),
                     lifeExp = c(45.6, 70.7,  73.4, 44.0),
                     pop     = c( 3.9,  8.8, 107.2, 12.4 ) )

[With population size in millions. Apart from rounding these data correspond to the gapminder data]

This minimalistic set of countries is interesting because of contrasting population
developments, but also some (potentially useful) categorical differences: two islands,
two land-locked countries, location in three contintents.

Later on, a vector with the continents could be added, country could be changed to character,
the pop could be multiplied by 10^6 (etc.)

Data for the same variables, but in a different year could be loaded from a file

life2002 <- read.csv('/data/life2002.csv')

After this it woud be natural to add a variable with years to both data frames, using
cbind(), and combine both data frames with rbind() [etc.]

In the section ('Exploring Data Frames'), the step from the small data set to the gapminder
data would then be quite natural.

One could state that this type of demographic and economic data is in fact already available,
for every country in the world ...

# life2002 would be generated with:
    
life2002 <- data.frame( country = c('Chad', 'Cuba', 'Japan', 'Nepal'),
                      lifeExp = c(50.5, 77.2, 82.0, 61.3),
                      pop     = c( 8.8, 11.2, 127.1, 25.9 ) )
write.csv(x = life2002, file = "data/life2002.csv", row.names = FALSE)

# the life1972 and life2002 data can be extracted from the gapminder data by:

cnt <- gapminder$country %in% c('Chad', 'Cuba', 'Japan', 'Nepal')
yr <-  gapminder$year %in% c(1972,2002)
gapminder[cnt&yr,] 
@naupaka

This comment has been minimized.

Member

naupaka commented Apr 2, 2018

@emielvanloon I like this approach. It seems much more conceptually coherent to work with the same dataset throughout if at all possible. And it seems like this change wouldn't require too much reworking of the existing content--it's just swapping out the data and changing the wording to reflect it.

@jcoliver

This comment has been minimized.

Collaborator

jcoliver commented Apr 3, 2018

I agree. It would be nice if we could maintain the "fun" with factors (by reading data from a file via read.csv), since the inevitability (and utility) of factors is important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment