-
-
Notifications
You must be signed in to change notification settings - Fork 383
Edits to make novice R lesson 01 more similar to novice Python lesson 01. #639
Edits to make novice R lesson 01 more similar to novice Python lesson 01. #639
Conversation
* Can also create with `data.frame()` function. | ||
* Find the number of rows and columns with `nrow(dat)` and `ncol(dat)`, respectively. | ||
* Rownames are usually 1, 2, ..., n. | ||
<pre class='in'><code>dat <- as.matrix(dat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curiously, the Github md renderer is hiding the dat <- as.matrix
line in the .md preview of this line, so I thought something was wrong. Looks okay here though -- must be due to the presence of the HTML tags. Not obvious why your .md output is not using md code blocks (like ~~~
) here though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so that the formatting of the R lessons is identical to that of the Python lessons. I accomplished this using knitr hooks as you suggested during the sprint (see this comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, though an extra newline after the <code>
and before x
would probably fix the Github rendering (unless that's for consistency sake too).
Hmm, can you link me to an example python though? I missed the part about switching into using <pre>
html tags in the output. It seems non-ideal to me to have this chimeric markdown-html mix for code blocks, (even though I suppose the .md is just an intermediate form at this point that no one cares about.) Using ~~~r
you have a syntax that every just about every markdown parser knows about (your Jekyll flavor: kramdown, pandoc, and Github flavor, anyhow), which should result in nice pretty syntax highlighting on Jekyll presuming you've set it up. Using explicit html "pre" blocks you have a far more ambiguous format that doesn't seem indicate the language used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, though an extra newline after the
<code>
and beforex
would probably fix the Github rendering (unless that's for consistency sake too).
The extra new line creates a blank line in all the code blocks.
The Python lessons are written in IPython Notebooks and then converted to Markdown using a custom script. All the novice Python lessons are in the bc repo in novice/python
. And you can view the final rendered version from the SWC website (Python lessons).
You are correct that using the "pre" blocks loses the syntax highlighting. For background context on how we ended up with the current system for building the lessons, you should start at Issue #349. If you have ideas on how the entire build process could be improved, I'd recommend sending an email to the discuss mailing list.
@jdblischak Mostly looks good to me, though as I'm not familiar with the intro python lessons you might want to ignore my comments...
Overall very nice tutorial though! Hope I didn't get too bogged down in the subjective and the minutia |
Thank you very much for the thorough review, @cboettig. All your suggestions are extremely reasonable. Most of the points you raise, e.g. matrix versus data frame, were choices I made to keep the lesson in line with the novice Python lesson. To give you some context, back in March all the SWC instructors interested in developing R lessons met to discuss our strategy. You can read the full summary, but the tl;dr version is that we chose to mirror the Python lessons instead of creating new R-specific lessons. This obviously has its pros and cons. The main pro is that it keeps the focus of the lessons in good programming practices in general, which can also be seen as its greatest con since it largely ignores wonderful tools like those in the Hadleyverse. So the goal of this set of lessons is not so much "Learn how to use R most effectively", it is "Teach novices how to program a computer to analyze data (using R)." The hope is that these best practices, e.g. modularity, don't repeat yourself, defensive programming, etc. will translate no matter what language they end up using. |
@jdblischak that makes perfect sense, thanks for the context. Of the options you list, I agree that translating lessons that you know work makes more sense than the alternatives, as does emphasizing such language-agnostic practices. I think there is still an open question as to what is considered a translation though. A matrix is the 'base-class' of matlab, but I would argue that it is not the 'base class' of R, and that the translation would be more complete by introducing the language's workhorse data structure rather than the one that best corresponds to the data structure in python. (After all, it's because you are most concerned with language agnostic concepts that you shouldn't care if the translation is the most bitwise equivalent to the python lesson, but rather that it is the best expression of those bigger concepts in whatever language it is expressed. To go to the logical extreme, no one would suggest that calling python from R was the best 'translation' into R, though it is clearly the most consistent). I'm not sure that I would suggest using the Hadleyverse tools either, but that is probably more a question of course content than of language. Teaching that data headings matter, that code should be self documenting rather than referring to numeric column indices, etc, may just not be part of the syllabus at this stage, so I see why you would ignore it. Nevertheless, I don't think using either such practices or explicit tools from hadleverse is inconsistent with the lesson goals. I'm not experienced enough to know which vocabulary (e.g. base R or Hadley's) is most effective at teaching students those bigger goals. Perhaps it makes no difference. |
This is by design. We have found that listing all the data types and data structures does not work well with complete novices. They have no context for why these details are important. This is why these novice lessons were originally designed (in Python) to be data-driven. We want to demonstrate why writing code to automate your analysis is such a better way to do science than using spreadsheet software. Observing how to perform a data analysis will give them the context to better understand data types and structures when it comes time to learn more advanced material.
Converting the data to a matrix allows us to perform
But the example used is numeric data. Thus I think it does make sense to use matrix. The only reason it was a data frame is because that is what is return by
Good point. I'll think about a way to discuss this more in depth after introducing
The data is supposed to be 60 patients as the rows and 40 days as the columns. When discussing
Totally agree. This is for the sake of consistency. Visualizing data as a heatmap returns in a later lesson, so I couldn't remove this.
I agree that using tools from the Hadleyverse does enhance the R experience. For now our plan is to use mainly base R in the main set of lessons, and then have supplementary lessons on useful R libraries, e.g. see PR #621 on ggplot2.
Thanks! This lesson has been the result of the cumulative effort of multiple dedicated SWC instructors. We appreciate the feedback. |
Thanks for the detailed replies. You clearly have a well developed and well tested lesson here, and I'm glad Just for the sake of clarity, I do not think being numeric justifies the Right, I realize the data in the example doesn't have variables as columns, On Fri, Aug 1, 2014 at 2:41 PM, John Blischak notifications@github.com
Carl Boettiger |
OK. I added a few more notes on how to call a function. There are more details on default parameters and when you can or cannot pass an object without naming it in the next lesson, so I didn't want to delve into too much detail.
I understand your concerns, but I don't see how I can appease them without creating a whole new lesson. Another way to represent the data is a data frame with the following columns:
But then the sections on slicing rows and columns, calculating summary statistics along the rows or columns, plotting a heatmap, etc. would all have to be completely different. And if we leave the data in the same dimensions but as a data frame, then the calls to |
Okay, sounds good to me. Carl Boettiger sent from mobile device; my apologies for any terseness or typos
|
@sarahsupp, @dbarneche, @gavinsimpson, @naupaka, and @jainsley. I'd be interested in your feedback. |
```{r} | ||
read.csv("data/inflammation-01.csv", header = FALSE) | ||
```{r, results="hide"} | ||
read.table(file = "inflammation-01.csv", sep = ",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change read.csv()
to read.table()
?
They both return a dataframe, but it seems to me that read.csv
is less of a mental leap since they are already working with csv files.
For other delimited data, hadley recommends read.delim
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also used read.csv()
in the reference sheet (#663), since it is in the current material.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three options are:
read.csv(file = "inflammation-01.csv", header = FALSE)
read.table(file = "inflammation-01.csv", sep = ",")
read.delim(file = "inflammation-01.csv", header = FALSE, sep = ",")
The use of either read.table
or read.csv
require specifying one parameter in addition to the filename. Since the file does not have a header, I think it makes more sense to focus on the delimiter than the non-existent header. This also avoids having to talk about Boolean values right at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also avoids having to talk about Boolean values right at the beginning.
Makes sense.
Would you plan to later on show the read.table()
documentation and show them that there are wrappers for common file formats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree @jdblischak. The file is a CSV file and read.csv()
is the natural choice here. I think one can explain header = FALSE
with (something along lines of) "there are no column labels, or header, hence we use... We'll explain more about FALSE
and TRUE
later."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the use of read.csv()
in this case. In my opinion, understanding TRUE
and FALSE
is much more intuitive than understanding read.csv()
vs. read.delim()
vs. read.table()
. It can be explained during teaching that we have a CSV file and R has a great built in function for importing CSV files.
But, this is all really nitpicky stuff. Overall, I think the lesson looks great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I agree, but the point of reviews is to comment on the broad things as well as the minor things. @jdblischak made some great changes but also chosen to alter something which I thought worked just fine before.
I'm going to push back on the use of a matrix here. Seems to me that focussing on that because we can run This is one of those tensions between Python and R and following the Python lessons too closely potentially at least makes for clumsy or non-R-like conventions being followed just for the sake of parity with the Python lessons. Also, and I thought this was the case, you can use > df <- data.frame(A = 1:10, B = rnorm(10))
> min(df)
[1] -1.768958
> max(df)
[1] 10
> mean(df)
[1] NA
Warning message:
In mean.default(df) : argument is not numeric or logical: returning NA I would suggest Better would be to do: mat <- as.matrix(dat)
mean(mat)
sd(mat) I'm not sure you can argue that doing |
|
||
We are studying inflammation in patients who have been given a new treatment for arthritis, | ||
and need to analyze the first dozen data sets. | ||
The data sets are stored in `.csv` each row holds information for a single patient, and the columns represent successive days. | ||
The data sets are stored in [comma-separated values](../../gloss.html#comma-separeted-values) (CSV) format: each row holds information for a single patient, and the columns represent successive days. | ||
The first few rows of our first file look like this: | ||
|
||
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these files ever change, the document would be inconsistent with the new data. I wonder if we could use readLines("foo", n = 6)
in a R code chunk to read the actual data?
I'm going to have to pull the changes locally as the diff in github in my Chrome is losing spaces and possibly other characters. If some of the line-notes above are due to this, please accept my apologies and ignore them. |
General Comments:
Minor comments:
Overall though @jdblischak this is looking really nice - especially the additions of the images and the links to other SWC content for the terminology. |
Thanks to everyone for the recent reviews. Since @gavinsimpson nicely summarized many of the main points we have been discussing, I'll respond to his comments below:
I removed the parentheses from the inline references to functions because the Python lessons do not use them. I have no strong opinion on the matter, but for the sake of consistency I think such a change should be proposed more widely (and voted on at a lab meeting if necessary).
The heatmaps are there for foreshadowing. In lesson 04, rblocks are used to create a heatmap. Instead of running
I agree that they will be more likely to encounter data frames, but all the functions that are performed on the matrix in this lesson work equally as well on a data frame. But I have now been out-voted, so I will switch back to a data frame. Which of these options would everyone prefer?
Adding the "parameters" terminology was just another attempt at consistency. However, we do have the SWC blessing to use "arguments" instead of "parameters" (lab meeting summary and Issue #511). I am not aware of any SWC discussion on "variable" vs "object", nor could I find any on bc or the discuss list. Anecdotally, when I first started learning programming I used the term variable because my mental model was that it was like a variable in algebra. However, I think the main SWC discuss list would be the place to have this conversation since it is not specific to R.
I struggled with this decision because the discussion of naming params/args is in the next lesson. If we name MARGIN, then I think we would need to name all of them.
I'm open to suggestions.
Thanks! |
Sounds good to me. I'm happy to go with the consensus as it is only a personal preference to differentiate objects from functions in the narrative.
I hadn't realised that (I just had a look). Whilst I see your point of foreshadowing, seems to me introducing something complicated like a heatmap with dendrograms etc in lesson one doesn't gain us much when you could just explain what a heat map is in lesson 4?At this stage (lesson 1) we're really just looking at the data over time and the
As you need to do mat <- as.matrix(dat)
mean(mat)
sd(mat) Good news re parameters vs arguments. Do you want to raise the objects vs variables part on SWC discuss?
I don't think you have to go this far. Perhaps do it the other way round; use the simple versions without named arguments but add a note that this is as if you'd called |
…converting data frame. Also removed heatmaps, which can be introduced later.
* Select individual values and subsections from data. | ||
* Perform operations on arrays of data. | ||
* Perform operations on a data frame of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you also want to mention matrix here?
the actual mean, median, and sd calls are on a matrix
but they're converted from a data frame, so I'm not sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just use "data set" and be done with it - the students don't need to worry about us doing some things on a data frame and some things on a matrix in these bullets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I think I addressed most of the main concerns in this latest round of commits. Please let me know if there is anything else you would like me to address. Also, please vote +1 when you approve this PR for merging. Gavin wrote:
The variable v. object distinction does not bother me that much, so I'll let you make the case on the discuss list. Either way, it would be an easy enough change to make in a future PR if we decide to change the terminology, so I'd prefer if it did not inhibit this PR from being merged. So one last change I would like to propose. I find the use of both a data frame and a matrix to be quite awkward, and it repeatedly comes up in future lessons. To resolve this in lesson 02 in PR #675, I simply changed all instances of taking the mean or standard deviation of an entire matrix to just a vector. Basically, I pass in one column (i.e. one day of inflammation data) instead of the entire matrix. I much prefer this approach. If others agree, I'll make similar changes to this lesson. |
I had written:
I went ahead and implemented this. Removing the duplicate set of data stored as a matrix by instead performing mathematical operations on the columns of the data frame makes this lesson and the following ones go much smoother. We've collectively put a lot of work into this PR. Could I please get some +1's for merging or let me know if there are additional points you would like to see addressed? @cboettig @gavinsimpson @chendaniely @jainsley Thanks! |
+1 from me - very pleased by the way you've pulled this together. |
That sounds good. I haven't looked at lesson 2 yet, but do we eventually use the |
I avoided it in lesson 02. I know for sure it will come up in lesson 04 to make a heatmap. Still working on lesson 03, so I still don't know for that one. |
👍 |
Hi @jdblischak, I also favour ditching the rather odd mathematical operation over what are essentially panel data. The current data structure is not an ideal one for this anyway. So +1 from me too. |
OK, this appears well supported. @gvwilson, please merge this when you get the chance. Thanks to everyone for the helpful reviews! |
Edits to make novice R lesson 01 more similar to novice Python lesson 01.
I have made edits so that the first R lesson more closely follows the Python lesson. It covers all the same topics save two:
par
options, which is not a beginner topic.Here's how you can quickly compare the two rendered files:
Then compare the rendered files
_site/novice/r/01-starting-with-data.html
and_site/novice/python/01-numpy.html
.