Skip to content
This repository has been archived by the owner on Jan 3, 2018. It is now read-only.

Edits to make novice R lesson 01 more similar to novice Python lesson 01. #639

Merged
merged 10 commits into from
Sep 2, 2014

Conversation

jdblischak
Copy link
Contributor

I have made edits so that the first R lesson more closely follows the Python lesson. It covers all the same topics save two:

  • Loading external packages - omitted because none of the operations in the lesson require loading an external package.
  • How to display three plots side-by-side - omitted because this is more complicated in R. This would require discussing how to set and reset par options, which is not a beginner topic.

Here's how you can quickly compare the two rendered files:

# clone bc repo
git clone git@github.com:swcarpentry/bc.git
# create and switch to new branch
git checkout -b review-pr
# add remote
git remote add jdb git@github.com:jdblischak/bc.git
# pull changes from my branch
git pull jdb 01-starting-with-data
# build site locally
make site

Then compare the rendered files _site/novice/r/01-starting-with-data.html and _site/novice/python/01-numpy.html.

@jdblischak jdblischak added the R label Aug 1, 2014
* Can also create with `data.frame()` function.
* Find the number of rows and columns with `nrow(dat)` and `ncol(dat)`, respectively.
* Rownames are usually 1, 2, ..., n.
<pre class='in'><code>dat <- as.matrix(dat)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiously, the Github md renderer is hiding the dat <- as.matrix line in the .md preview of this line, so I thought something was wrong. Looks okay here though -- must be due to the presence of the HTML tags. Not obvious why your .md output is not using md code blocks (like ~~~) here though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so that the formatting of the R lessons is identical to that of the Python lessons. I accomplished this using knitr hooks as you suggested during the sprint (see this comment).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, though an extra newline after the <code> and before x would probably fix the Github rendering (unless that's for consistency sake too).

Hmm, can you link me to an example python though? I missed the part about switching into using <pre> html tags in the output. It seems non-ideal to me to have this chimeric markdown-html mix for code blocks, (even though I suppose the .md is just an intermediate form at this point that no one cares about.) Using ~~~r you have a syntax that every just about every markdown parser knows about (your Jekyll flavor: kramdown, pandoc, and Github flavor, anyhow), which should result in nice pretty syntax highlighting on Jekyll presuming you've set it up. Using explicit html "pre" blocks you have a far more ambiguous format that doesn't seem indicate the language used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, though an extra newline after the <code> and before x would probably fix the Github rendering (unless that's for consistency sake too).

The extra new line creates a blank line in all the code blocks.

The Python lessons are written in IPython Notebooks and then converted to Markdown using a custom script. All the novice Python lessons are in the bc repo in novice/python. And you can view the final rendered version from the SWC website (Python lessons).

You are correct that using the "pre" blocks loses the syntax highlighting. For background context on how we ended up with the current system for building the lessons, you should start at Issue #349. If you have ideas on how the entire build process could be improved, I'd recommend sending an email to the discuss mailing list.

@cboettig
Copy link

cboettig commented Aug 1, 2014

@jdblischak Mostly looks good to me, though as I'm not familiar with the intro python lessons you might want to ignore my comments...

  • You mention classes data.frame and matrix without giving much of a sense of what these object classes are or how they differ (which is, I believe, a frequent source of confusion). Personally I would not introduce the matrix object at this time, and teach people to always work with data.frames until there is a compelling reason not to (computational efficiency blah; that's a job for better internals, see dplyr package if that's a concern). I would likewise focus on subsetting data.frames. Perhaps the matrix class is a more natural analogue to the python lesson, and it certainly is to other common languages, but I don't think that's a compelling reason to focus on matrices over data.frames in R.
  • I'd spend a moment to explain what an R function is (not how to write one, but the fact that arguments are named, that you must give them in the order named or better specify them as name-value pairs).
  • In particular, I find jumping into an apply function without some understanding of this is rather advanced, particularly as the apply syntax is not nearly as intuitive / semantic as most function calls.
  • I would spend this time getting users to think about having well structured data instead, and then how to work with functions across that data in a meaningful way. The current example is very abstract -- the data is a matrix of numbers, and we the apply examples just calculate arbitrary statistics across arbitrary dimensions. I feel it would be much more useful to get users to think first about how their data is structured, and to get it structured in a logical way: variables as columns, with column names, observations as rows. A user is then unlikely to need to do something like blindly taking the mean across all the rows or all the columns, but rather apply certain functions to certain variables. Code that makes use of column names explicitly is much easier to understand than code that refers to "dimension 2". But that could be just me.
  • A heatmap feels like an un-intuitively complicated function for a first plot, but what do I know.
  • Maybe this is crazy, but I wonder if it would be worth teaching certain idioms from the hadleyverse from the beginning? Most of us didn't learn that way (guess these things didn't exist -- I'm still rewiring my brain), and there's certainly a downside to learning a specialized subset of R instead of base R functions. Nevertheless, some of the use cases are very compelling, and I can only regret the hours and hours I spent learning all about multipanel plots with obscure par options that I will never use again having adopted ggplot...

Overall very nice tutorial though! Hope I didn't get too bogged down in the subjective and the minutia

@jdblischak
Copy link
Contributor Author

Thank you very much for the thorough review, @cboettig. All your suggestions are extremely reasonable. Most of the points you raise, e.g. matrix versus data frame, were choices I made to keep the lesson in line with the novice Python lesson. To give you some context, back in March all the SWC instructors interested in developing R lessons met to discuss our strategy. You can read the full summary, but the tl;dr version is that we chose to mirror the Python lessons instead of creating new R-specific lessons. This obviously has its pros and cons. The main pro is that it keeps the focus of the lessons in good programming practices in general, which can also be seen as its greatest con since it largely ignores wonderful tools like those in the Hadleyverse.

So the goal of this set of lessons is not so much "Learn how to use R most effectively", it is "Teach novices how to program a computer to analyze data (using R)." The hope is that these best practices, e.g. modularity, don't repeat yourself, defensive programming, etc. will translate no matter what language they end up using.

@cboettig
Copy link

cboettig commented Aug 1, 2014

@jdblischak that makes perfect sense, thanks for the context. Of the options you list, I agree that translating lessons that you know work makes more sense than the alternatives, as does emphasizing such language-agnostic practices.

I think there is still an open question as to what is considered a translation though. A matrix is the 'base-class' of matlab, but I would argue that it is not the 'base class' of R, and that the translation would be more complete by introducing the language's workhorse data structure rather than the one that best corresponds to the data structure in python. (After all, it's because you are most concerned with language agnostic concepts that you shouldn't care if the translation is the most bitwise equivalent to the python lesson, but rather that it is the best expression of those bigger concepts in whatever language it is expressed. To go to the logical extreme, no one would suggest that calling python from R was the best 'translation' into R, though it is clearly the most consistent).

I'm not sure that I would suggest using the Hadleyverse tools either, but that is probably more a question of course content than of language. Teaching that data headings matter, that code should be self documenting rather than referring to numeric column indices, etc, may just not be part of the syllabus at this stage, so I see why you would ignore it.

Nevertheless, I don't think using either such practices or explicit tools from hadleverse is inconsistent with the lesson goals. I'm not experienced enough to know which vocabulary (e.g. base R or Hadley's) is most effective at teaching students those bigger goals. Perhaps it makes no difference.

@jdblischak
Copy link
Contributor Author

You mention classes data.frame and matrix without giving much of a sense of what these object classes are or how they differ (which is, I believe, a frequent source of confusion).

This is by design. We have found that listing all the data types and data structures does not work well with complete novices. They have no context for why these details are important. This is why these novice lessons were originally designed (in Python) to be data-driven. We want to demonstrate why writing code to automate your analysis is such a better way to do science than using spreadsheet software. Observing how to perform a data analysis will give them the context to better understand data types and structures when it comes time to learn more advanced material.

Personally I would not introduce the matrix object at this time, and teach people to always work with data.frames until there is a compelling reason not to...

Converting the data to a matrix allows us to perform max, min, mean, and sd on the entire dataset. Also it allows us to use the heatmap function.

I would likewise focus on subsetting data.frames. Perhaps the matrix class is a more natural analogue to the python lesson, and it certainly is to other common languages, but I don't think that's a compelling reason to focus on matrices over data.frames in R.

But the example used is numeric data. Thus I think it does make sense to use matrix. The only reason it was a data frame is because that is what is return by read.table, which is a lot simpler than matrix(scan(.... I totally agree that the data frame is the most important data structure in R, but the data set that was designed for these novice lessons does not facilitate using data frames. For example, subsetting with integers is exactly the same for a matrix and a data frame, so this will transfer well when they do use a data frame.

I'd spend a moment to explain what an R function is (not how to write one, but the fact that arguments are named, that you must give them in the order named or better specify them as name-value pairs).

Good point. I'll think about a way to discuss this more in depth after introducing read.table.

I would spend this time getting users to think about having well structured data instead, and then how to work with functions across that data in a meaningful way. The current example is very abstract...

The data is supposed to be 60 patients as the rows and 40 days as the columns. When discussing apply for example, we connect taking the mean along "dimension 1" with taking the mean inflammation value per patient. And similarly, "dimension 2" is days. This is explained at the beginning of the lesson. Perhaps we could reiterate what the data is later in the lesson when it is being read in?

A heatmap feels like an un-intuitively complicated function for a first plot, but what do I know.

Totally agree. This is for the sake of consistency. Visualizing data as a heatmap returns in a later lesson, so I couldn't remove this.

Maybe this is crazy, but I wonder if it would be worth teaching certain idioms from the hadleyverse from the beginning?

I agree that using tools from the Hadleyverse does enhance the R experience. For now our plan is to use mainly base R in the main set of lessons, and then have supplementary lessons on useful R libraries, e.g. see PR #621 on ggplot2.

Overall very nice tutorial though! Hope I didn't get too bogged down in the subjective and the minutia

Thanks! This lesson has been the result of the cumulative effort of multiple dedicated SWC instructors. We appreciate the feedback.

@cboettig
Copy link

cboettig commented Aug 1, 2014

Thanks for the detailed replies.

You clearly have a well developed and well tested lesson here, and I'm glad
to learn from your experiences.

Just for the sake of clarity, I do not think being numeric justifies the
use of a matrix; there's nothing wrong with a data frame in which all
columns are numeric, which is why read.table returns a data frame in the
first place. (Besides noting that a date class may be a more
appropriate). I agree that you don't want to get into all the object
classes here, so it would be better to avoid the examples that explicitly
refer to changing the class, right? (At least perhaps a disclaimer that
you will explain later why you've switched from data.frame to matrix and
what the differences are).

Right, I realize the data in the example doesn't have variables as columns,
but treats one column ids as an implicit variable and row ids as an
implicit variable. While it has no impact on learning programming skills
per se, I believe students would be better served in the long run if they
weren't taught bad habits in working with data (I realize that's this is
probably an issue in the python lesson as well, so maybe it should be left
for another time).

On Fri, Aug 1, 2014 at 2:41 PM, John Blischak notifications@github.com
wrote:

You mention classes data.frame and matrix without giving much of a sense
of what these object classes are or how they differ (which is, I believe, a
frequent source of confusion).

This is by design. We have found that listing all the data types and data
structures does not work well with complete novices. They have no context
for why these details are important. This is why these novice lessons were
originally designed (in Python) to be data-driven. We want to demonstrate
why writing code to automate your analysis is such a better way to do
science than using spreadsheet software. Observing how to perform a data
analysis will give them the context to better understand data types and
structures when it comes time to learn more advanced material.

Personally I would not introduce the matrix object at this time, and teach
people to always work with data.frames until there is a compelling reason
not to...

Converting the data to a matrix allows us to perform max, min, mean, and
sd on the entire dataset. Also it allows us to use the heatmap function.

I would likewise focus on subsetting data.frames. Perhaps the matrix class
is a more natural analogue to the python lesson, and it certainly is to
other common languages, but I don't think that's a compelling reason to
focus on matrices over data.frames in R.

But the example used is numeric data. Thus I think it does make sense to
use matrix. The only reason it was a data frame is because that is what is
return by read.table, which is a lot simpler than matrix(scan(.... I
totally agree that the data frame is the most important data structure in
R, but the data set that was designed for these novice lessons does not
facilitate using data frames. For example, subsetting with integers is
exactly the same for a matrix and a data frame, so this will transfer well
when they do use a data frame.

I'd spend a moment to explain what an R function is (not how to write one,
but the fact that arguments are named, that you must give them in the order
named or better specify them as name-value pairs).

Good point. I'll think about a way to discuss this more in depth after
introducing read.table.

I would spend this time getting users to think about having well
structured data instead, and then how to work with functions across that
data in a meaningful way. The current example is very abstract...

The data is supposed to be 60 patients as the rows and 40 days as the
columns. When discussing apply for example, we connect taking the mean
along "dimension 1" with taking the mean inflammation value per patient.
And similarly, "dimension 2" is days. This is explained at the beginning of
the lesson. Perhaps we could reiterate what the data is later in the lesson
when it is being read in?

A heatmap feels like an un-intuitively complicated function for a first
plot, but what do I know.

Totally agree. This is for the sake of consistency. Visualizing data as a
heatmap returns in a later lesson, so I couldn't remove this.

Maybe this is crazy, but I wonder if it would be worth teaching certain
idioms from the hadleyverse from the beginning?

I agree that using tools from the Hadleyverse does enhance the R
experience. For now our plan is to use mainly base R in the main set of
lessons, and then have supplementary lessons on useful R libraries, e.g.
see PR #621 #621 on ggplot2.

Overall very nice tutorial though! Hope I didn't get too bogged down in
the subjective and the minutia

Thanks! This lesson has been the result of the cumulative effort of
multiple dedicated SWC instructors. We appreciate the feedback.


Reply to this email directly or view it on GitHub
#639 (comment).

Carl Boettiger
UC Santa Cruz
http://carlboettiger.info/

@jdblischak
Copy link
Contributor Author

OK. I added a few more notes on how to call a function. There are more details on default parameters and when you can or cannot pass an object without naming it in the next lesson, so I didn't want to delve into too much detail.

I do not think being numeric justifies the use of a matrix; there's nothing wrong with a data frame in which all columns are numeric, which is why read.table returns a data frame in the first place.

I believe students would be better served in the long run if they weren't taught bad habits in working with data (I realize that's this is probably an issue in the python lesson as well, so maybe it should be left for another time).

I understand your concerns, but I don't see how I can appease them without creating a whole new lesson. Another way to represent the data is a data frame with the following columns:

day patient inflammation
1 1 10
1 2 5
1 3 0
1 4 16
... ... ...

But then the sections on slicing rows and columns, calculating summary statistics along the rows or columns, plotting a heatmap, etc. would all have to be completely different.

And if we leave the data in the same dimensions but as a data frame, then the calls to heatmap, mean, and sd would all have to be changed to heatmap(as.matrix(dat)), mean(as.matrix(dat)), and sd(as.matrix(dat)). I prefer my solution of converting it to a matrix instead of using nested function calls.

@cboettig
Copy link

cboettig commented Aug 7, 2014

Okay, sounds good to me.


Carl Boettiger
http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos
On Aug 7, 2014 2:41 PM, "John Blischak" notifications@github.com wrote:

OK. I added a few more notes on how to call a function. There are more
details on default parameters and when you can or cannot pass an object
without naming it in the next lesson, so I didn't want to delve into too
much detail.

I do not think being numeric justifies the use of a matrix; there's
nothing wrong with a data frame in which all columns are numeric, which is
why read.table returns a data frame in the first place.

I believe students would be better served in the long run if they weren't
taught bad habits in working with data (I realize that's this is probably
an issue in the python lesson as well, so maybe it should be left for
another time).

I understand your concerns, but I don't see how I can appease them without
creating a whole new lesson. Another way to represent the data is a data
frame with the following columns:
day patient inflammation 1 1 10 1 2 5 1 3 0 1 4 16 ... ... ...

But then the sections on slicing rows and columns, calculating summary
statistics along the rows or columns, plotting a heatmap, etc. would all
have to be completely different.

And if we leave the data in the same dimensions but as a data frame, then
the calls to heatmap, mean, and sd would all have to be changed to
heatmap(as.matrix(dat)), mean(as.matrix(dat)), and sd(as.matrix(dat)). I
prefer my solution of converting it to a matrix instead of using nested
function calls.


Reply to this email directly or view it on GitHub
#639 (comment).

@jdblischak
Copy link
Contributor Author

@sarahsupp, @dbarneche, @gavinsimpson, @naupaka, and @jainsley. I'd be interested in your feedback.

```{r}
read.csv("data/inflammation-01.csv", header = FALSE)
```{r, results="hide"}
read.table(file = "inflammation-01.csv", sep = ",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change read.csv() to read.table()?
They both return a dataframe, but it seems to me that read.csv is less of a mental leap since they are already working with csv files.

For other delimited data, hadley recommends read.delim here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also used read.csv() in the reference sheet (#663), since it is in the current material.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three options are:

  1. read.csv(file = "inflammation-01.csv", header = FALSE)
  2. read.table(file = "inflammation-01.csv", sep = ",")
  3. read.delim(file = "inflammation-01.csv", header = FALSE, sep = ",")

The use of either read.table or read.csv require specifying one parameter in addition to the filename. Since the file does not have a header, I think it makes more sense to focus on the delimiter than the non-existent header. This also avoids having to talk about Boolean values right at the beginning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also avoids having to talk about Boolean values right at the beginning.

Makes sense.

Would you plan to later on show the read.table() documentation and show them that there are wrappers for common file formats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree @jdblischak. The file is a CSV file and read.csv() is the natural choice here. I think one can explain header = FALSE with (something along lines of) "there are no column labels, or header, hence we use... We'll explain more about FALSE and TRUE later."?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the use of read.csv() in this case. In my opinion, understanding TRUE and FALSE is much more intuitive than understanding read.csv() vs. read.delim() vs. read.table(). It can be explained during teaching that we have a CSV file and R has a great built in function for importing CSV files.

But, this is all really nitpicky stuff. Overall, I think the lesson looks great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I agree, but the point of reviews is to comment on the broad things as well as the minor things. @jdblischak made some great changes but also chosen to alter something which I thought worked just fine before.

@gavinsimpson
Copy link
Contributor

I'm going to push back on the use of a matrix here. Seems to me that focussing on that because we can run min(), max() etc on the matrix is a weak argument to me. I echo @cboettig's comments here; users are going to encounter data frames much more frequently than matrices.

This is one of those tensions between Python and R and following the Python lessons too closely potentially at least makes for clumsy or non-R-like conventions being followed just for the sake of parity with the Python lessons.

Also, and I thought this was the case, you can use min() and max() on a data.frame, just not mean() which seems bone-headedly stupid:

> df <- data.frame(A = 1:10, B = rnorm(10))
> min(df)
[1] -1.768958
> max(df)
[1] 10
> mean(df)
[1] NA
Warning message:
In mean.default(df) : argument is not numeric or logical: returning NA

I would suggest mean(as.matrix(df)) could be a useful compromised until such a time as R becomes less stoopid in this regard?

Better would be to do:

mat <- as.matrix(dat)
mean(mat)
sd(mat)

I'm not sure you can argue that doing heatmap(as.matrix(dat)) is inconvenient as you chose to add this plot type in your revisions ;-)


We are studying inflammation in patients who have been given a new treatment for arthritis,
and need to analyze the first dozen data sets.
The data sets are stored in `.csv` each row holds information for a single patient, and the columns represent successive days.
The data sets are stored in [comma-separated values](../../gloss.html#comma-separeted-values) (CSV) format: each row holds information for a single patient, and the columns represent successive days.
The first few rows of our first file look like this:

0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these files ever change, the document would be inconsistent with the new data. I wonder if we could use readLines("foo", n = 6) in a R code chunk to read the actual data?

@gavinsimpson
Copy link
Contributor

I'm going to have to pull the changes locally as the diff in github in my Chrome is losing spaces and possibly other characters. If some of the line-notes above are due to this, please accept my apologies and ignore them.

@gavinsimpson
Copy link
Contributor

General Comments:

  1. There is inconsistency in the way functions are referred to; with or without parentheses: read.csv vs read.csv(). I'm pro read.csv() as it helps to reinforce a distinction between functions and objects (which we call variables here).
  2. I'm not sold on the heatmap idea. If you aren't used to these and the concept of hot colours, you need to have your brain dialed in to the fact that red is cold, yellow hotter, white, scorching. Would a matplot(dat) not be more generally applicable here (perhaps with type = "l" to get lines, one per patient)? I certainly don't see the need to have two heatmaps in here; if we're set on retaining them, perhaps keep just the second one?
  3. Forcing the conversion from a data frame to a matrix is something I'm not keen on, as mentioned above. In my opinion it is better to work around the infelicity in the case of mean() on a data frame than force working with a matrix.

Minor comments:

  1. I really don't like the terminology used in some places. "parameters" for "arguments" and "variables' for "objects". Which is more confusing for novice students; being exposed to a slightly odd term from programming parlance or learning something more familiar ("parameters") which they have to unlearn when the delve deeper and start making use of online materials from outside the SWC sphere? Likewise "variables" which is confusing with the same term in statistics, which is predominantly why people use R even if this is not the motivation behind the SWC lessons or bootcamps.
  2. Where you explain apply() and the MARGIN argument, it would be good to link the use of MARGIN = 1 in the text with the actual call. Hence use apply(dat, MARGIN = 1, mean) say. That makes it clear what the 1 or 2 are. Then say that commonly the name is dropped, hence apply(dat, 2, mean). Might help to reinforce what this is and why you called it MARGIN earlier.
  3. We need to be careful to not propagate the behaviour of having lengthy variable/object names. There is a compromise between being terse or verbose and clarity of code. avg_patient_inflammation for example exceeds acceptable limits for me - this shouldn't be an exercise in typing.

Overall though @jdblischak this is looking really nice - especially the additions of the images and the links to other SWC content for the terminology.

@jdblischak
Copy link
Contributor Author

Thanks to everyone for the recent reviews. Since @gavinsimpson nicely summarized many of the main points we have been discussing, I'll respond to his comments below:

There is inconsistency in the way functions are referred to; with or without parentheses: read.csv vs read.csv(). I'm pro read.csv() as it helps to reinforce a distinction between functions and objects (which we call variables here).

I removed the parentheses from the inline references to functions because the Python lessons do not use them. I have no strong opinion on the matter, but for the sake of consistency I think such a change should be proposed more widely (and voted on at a lab meeting if necessary).

I'm not sold on the heatmap idea. If you aren't used to these and the concept of hot colours, you need to have your brain dialed in to the fact that red is cold, yellow hotter, white, scorching. Would a matplot(dat) not be more generally applicable here (perhaps with type = "l" to get lines, one per patient)? I certainly don't see the need to have two heatmaps in here; if we're set on retaining them, perhaps keep just the second one?

The heatmaps are there for foreshadowing. In lesson 04, rblocks are used to create a heatmap. Instead of running heatmap and suppressing the dendograms, we could alternatively run image on the transposed (t) data.

Forcing the conversion from a data frame to a matrix is something I'm not keen on, as mentioned above. In my opinion it is better to work around the infelicity in the case of mean() on a data frame than force working with a matrix.

users are going to encounter data frames much more frequently than matrices.

I agree that they will be more likely to encounter data frames, but all the functions that are performed on the matrix in this lesson work equally as well on a data frame. But I have now been out-voted, so I will switch back to a data frame. Which of these options would everyone prefer?

mean(as.matrix(dat))
# or
mat <- as.matrix(dat)
mean(mat)
# or something else

I really don't like the terminology used in some places. "parameters" for "arguments" and "variables' for "objects". Which is more confusing for novice students; being exposed to a slightly odd term from programming parlance or learning something more familiar ("parameters") which they have to unlearn when the delve deeper and start making use of online materials from outside the SWC sphere? Likewise "variables" which is confusing with the same term in statistics, which is predominantly why people use R even if this is not the motivation behind the SWC lessons or bootcamps.

Adding the "parameters" terminology was just another attempt at consistency. However, we do have the SWC blessing to use "arguments" instead of "parameters" (lab meeting summary and Issue #511).

I am not aware of any SWC discussion on "variable" vs "object", nor could I find any on bc or the discuss list. Anecdotally, when I first started learning programming I used the term variable because my mental model was that it was like a variable in algebra. However, I think the main SWC discuss list would be the place to have this conversation since it is not specific to R.

Where you explain apply() and the MARGIN argument, it would be good to link the use of MARGIN = 1 in the text with the actual call. Hence use apply(dat, MARGIN = 1, mean) say. That makes it clear what the 1 or 2 are. Then say that commonly the name is dropped, hence apply(dat, 2, mean). Might help to reinforce what this is and why you called it MARGIN earlier.

I struggled with this decision because the discussion of naming params/args is in the next lesson. If we name MARGIN, then I think we would need to name all of them. apply(X = dat, MARGIN = 1, FUN = mean) just looks so bad. I'll do it this verbose way and then add a comment as you recommended.

We need to be careful to not propagate the behaviour of having lengthy variable/object names. There is a compromise between being terse or verbose and clarity of code. avg_patient_inflammation for example exceeds acceptable limits for me - this shouldn't be an exercise in typing.

I'm open to suggestions.

Overall though @jdblischak this is looking really nice - especially the additions of the images and the links to other SWC content for the terminology.

Thanks!

@gavinsimpson
Copy link
Contributor

I removed the parentheses from the inline references to functions because the Python lessons do not use them. I have no strong opinion on the matter, but for the sake of consistency I think such a change should be proposed more widely (and voted on at a lab meeting if necessary).

Sounds good to me. I'm happy to go with the consensus as it is only a personal preference to differentiate objects from functions in the narrative.

The heatmaps are there for foreshadowing. In lesson 04, rblocks are used to create a heatmap. Instead of running heatmap and suppressing the dendograms, we could alternatively run image on the transposed (t) data.

I hadn't realised that (I just had a look). Whilst I see your point of foreshadowing, seems to me introducing something complicated like a heatmap with dendrograms etc in lesson one doesn't gain us much when you could just explain what a heat map is in lesson 4?At this stage (lesson 1) we're really just looking at the data over time and the matplot(dat, type = "l") (possibly matplot(t(dat), type = "l")?) gives a reasonable overview of those data coded by patient. Wonder what other people think?

But I have now been out-voted, so I will switch back to a data frame. Which of these options would everyone prefer?

As you need to do sd(dat) too, I'd prefer

mat <- as.matrix(dat)
mean(mat)
sd(mat)

Good news re parameters vs arguments. Do you want to raise the objects vs variables part on SWC discuss?

I'll do it this verbose way and then add a comment as you recommended

I don't think you have to go this far. Perhaps do it the other way round; use the simple versions without named arguments but add a note that this is as if you'd called apply(dat, MARGIN = 1, mean) just to reinforce where the 1 comes from? Would that be a more acceptable compromise?

* Select individual values and subsections from data.
* Perform operations on arrays of data.
* Perform operations on a data frame of data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you also want to mention matrix here?

the actual mean, median, and sd calls are on a matrix

but they're converted from a data frame, so I'm not sure...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just use "data set" and be done with it - the students don't need to worry about us doing some things on a data frame and some things on a matrix in these bullets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jdblischak
Copy link
Contributor Author

I think I addressed most of the main concerns in this latest round of commits. Please let me know if there is anything else you would like me to address. Also, please vote +1 when you approve this PR for merging.

Gavin wrote:

Good news re parameters vs arguments. Do you want to raise the objects vs variables part on SWC discuss?

The variable v. object distinction does not bother me that much, so I'll let you make the case on the discuss list. Either way, it would be an easy enough change to make in a future PR if we decide to change the terminology, so I'd prefer if it did not inhibit this PR from being merged.

So one last change I would like to propose. I find the use of both a data frame and a matrix to be quite awkward, and it repeatedly comes up in future lessons. To resolve this in lesson 02 in PR #675, I simply changed all instances of taking the mean or standard deviation of an entire matrix to just a vector. Basically, I pass in one column (i.e. one day of inflammation data) instead of the entire matrix. I much prefer this approach. If others agree, I'll make similar changes to this lesson.

@jdblischak
Copy link
Contributor Author

I had written:

So one last change I would like to propose. I find the use of both a data frame and a matrix to be quite awkward, and it repeatedly comes up in future lessons. To resolve this in lesson 02 in PR #675, I simply changed all instances of taking the mean or standard deviation of an entire matrix to just a vector. Basically, I pass in one column (i.e. one day of inflammation data) instead of the entire matrix. I much prefer this approach. If others agree, I'll make similar changes to this lesson.

I went ahead and implemented this. Removing the duplicate set of data stored as a matrix by instead performing mathematical operations on the columns of the data frame makes this lesson and the following ones go much smoother.

We've collectively put a lot of work into this PR. Could I please get some +1's for merging or let me know if there are additional points you would like to see addressed? @cboettig @gavinsimpson @chendaniely @jainsley Thanks!

@gvwilson
Copy link
Contributor

+1 from me - very pleased by the way you've pulled this together.

@chendaniely
Copy link
Contributor

Removing the duplicate set of data stored as a matrix by instead performing mathematical operations on the columns of the data frame makes this lesson and the following ones go much smoother.

That sounds good. I haven't looked at lesson 2 yet, but do we eventually use the as.matrix or as.dataframe or whatever?

@jdblischak
Copy link
Contributor Author

do we eventually use the as.matrix or as.dataframe or whatever?

I avoided it in lesson 02. I know for sure it will come up in lesson 04 to make a heatmap. Still working on lesson 03, so I still don't know for that one.

@chendaniely
Copy link
Contributor

👍

@gavinsimpson
Copy link
Contributor

Hi @jdblischak,

I also favour ditching the rather odd mathematical operation over what are essentially panel data. The current data structure is not an ideal one for this anyway.

So +1 from me too.

@jdblischak
Copy link
Contributor Author

OK, this appears well supported. @gvwilson, please merge this when you get the chance.

Thanks to everyone for the helpful reviews!

gvwilson pushed a commit that referenced this pull request Sep 2, 2014
Edits to make novice R lesson 01 more similar to novice Python lesson 01.
@gvwilson gvwilson merged commit 17e94c1 into swcarpentry:master Sep 2, 2014
@jdblischak jdblischak deleted the 01-starting-with-data branch September 4, 2014 02:32
@jdblischak jdblischak restored the 01-starting-with-data branch September 8, 2014 17:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants