Skip to content
This repository has been archived by the owner on Jan 3, 2018. It is now read-only.

R motivational slides #614

Closed
wants to merge 12 commits into from

Conversation

sritchie73
Copy link
Contributor

Created some motivational slides for why you should learn and use R.
Issues that need solving:

  • A font with better character spacing needs to be selected. Titles with "R" in them are unclear
  • Additional images would be nice for visual support
  • Logos for R and RStudio would be useful for the title slide.

@rgaiacs
Copy link

rgaiacs commented Jul 23, 2014

What about mention mixing R with LaTeX or Markdown?

@jdblischak jdblischak added the R label Jul 23, 2014
@jdblischak
Copy link
Contributor

+1 to @r-gaia-cs 's suggestion to highlight the ability to create nice, well-documented reports from your analyses.

These slides do a good job on the long-term motivations for using R, which is that no matter what complicated a problem you will encounter in the future, you will likely not have to start from scratch, and you can also extend it yourself relatively easily. But what about the short term for those completely new to programming? I would argue that if you have a tabular data set, R is the quickest option to start learning about your data as you learn the language. Consider this small example:

my_dat <- read.table("data.txt")
summary(my_dat)
boxplot(continous_var ~ categorical_var, data = my_dat)

This is so empowering because you can get to this level after only a little bit of learning R. Most other languages have a much steeper learning curve. And even in Python, you would first have to import pandas and matplotlib and be familiar with calling methods with dot notation before accomplishing something similar.

@benmarwick
Copy link
Contributor

This looks wonderful, concise and covers all the main selling points. I have two suggestions for additions, perhaps just two bullet points somewhere on the slides:

First, I'd also add somewhere that R has a very large community of users who are generally very helpful, and add mention some of the bigger and better sources of free online information like http://stackoverflow.com/questions/tagged/r, http://www.statmethods.net/, http://www.twotorials.com/ and so on. And package authors are generally willing to help users with their packages. This is a very important detail because people who have grown up with a commercial stats package will be anxious about switching to R and not having a help line that they are entitled to call because of their licensing fees. They will want to know that help is available for R, but it's not in the form that they might be used to!

Second, I'd make a brief mention of how R improves the reproducibility and transparency of research. People using point-and-click stats packages are typically not very aware of this issue (because of how hard it is to do reproducible research with a point-and-click interface), and as a script-driven environment R gives it to them for free. Using R, a researcher can script an analysis that they can run over and over with different data, with different projects, and give to someone else to use (ie. students) and verify . They can also publish their code online for others to inspect and validate their analyses, and so on. Since the target audience here is mostly non-programmers, this benefit of openness that we get from a stats package based on scripting is likely to make quite an impression. This might be a good topic to connect to LaTeX and Markdown as @jdblischak and @r-gaia-cs suggest.

@sritchie73
Copy link
Contributor Author

All worth mentioning, I didnt think of adding those points: I was trying to
identify things that R has that python doesn't.
On 24/07/2014 12:16 AM, "Ben Marwick" notifications@github.com wrote:

This looks wonderful, concise and covers all the main selling points. I
have two suggestions for additions, perhaps just two bullet points
somewhere on the slides:

First, I'd also add somewhere that R has a very large community of users
who are generally very helpful, and add mention of the bigger and better
sources of free online information like
http://stackoverflow.com/questions/tagged/r, http://www.statmethods.net/,
http://www.twotorials.com/ and so on. And package authors are generally
willing to help users with their packages. This is a very important detail
because people who have grown up with a commercial stats package will be
anxious about switching to R and not having a help line that they are
entitled to call because of their licensing fees. They will want to know
that help is available for R, but it's not in the form that they might be
used to!

Second, I'd make a brief mention of how R improves the reproducibility and
transparency of research. People using point-and-click stats packages are
typically not very aware of this issue (because of how hard it is to do
reproducible research with a point-and-click interface), and as a
script-driven environment R gives it to them for free. Using R, a
researcher can script an analysis that they can run over and over with
different data, with different projects, and give to someone else to use
(ie. students) and verify . They can also publish their code online for
others to inspect and validate their analyses, and so on. Since the target
audience here is mostly non-programmers, this benefit of openness that we
get from a stats package based on scripting is likely to make quite an
impression.


Reply to this email directly or view it on GitHub
#614 (comment).

@sritchie73
Copy link
Contributor Author

A couple of counter arguments:

  • R is fantastic for a novice when your data is well formed, but how
    often is that really the case? Cleaning data can often be easier in bash or
    python I find.
    • also obtuse legacy behaviour like stringsAsFactorsrs=TRUE
    • The Boxplot command requires knowledge of ~ , which is just as advanced
      as . Notation

But I agree with you, definitely the easiest and fastest language for
plotting otherwise!
On 23/07/2014 11:41 PM, "John Blischak" notifications@github.com wrote:

+1 to @r-gaia-cs https://github.com/r-gaia-cs 's suggestion to
highlight the ability to create nice, well-documented reports from your
analyses.

These slides do a good job on the long-term motivations for using R, which
is that no matter what complicated a problem you will encounter in the
future, you will likely not have to start from scratch, and you can also
extend it yourself relatively easily. But what about the short term for
those completely new to programming? I would argue that if you have a
tabular data set, R is the quickest option to start learning about your
data as you learn the language. Consider this small example:

my_dat <- read.table("data.txt")summary(my_dat)
boxplot(continous_var ~ categorical_var, data = my_dat)

This is so empowering because you can get to this level after only a
little bit of learning R. Most other languages have a much steeper learning
curve. And even in Python, you would first have to import pandas and
matplotlib and be familiar with calling methods with dot notation before
accomplishing something similar.


Reply to this email directly or view it on GitHub
#614 (comment).

@gvwilson
Copy link
Contributor

It's fine to have arguments pro and con in motivational slides - if
you're the first person to point out something's limits, it makes the
rest of what you say more believable.

@cboettig
Copy link

Looks great. Listing some cons wise -- I'd just say the syntax is more challenging/frustrating than most, (so that users don't get too discouraged when they struggle with the use of ~ or y <- x[["a"]]$b.v[1])

@sritchie73 I'm surprised that you find data cleaning harder in R, I would have listed that as one of it's greatest strengths! Have you had a read through http://vita.had.co.nz/papers/tidy-data.html or more recent http://blog.rstudio.org/2014/07/22/introducing-tidyr/ ?

@gavinsimpson
Copy link
Contributor

@sritchie73 @gvwilson It is important to be neutral in providing pros and cons and some of those cons are very personal. stringsAsFactors = TRUE is, as far as I am concerned, just great; it stops people doing silly things with categorical data when fitting statistical models and let's not forget that is why R exists in the first place. It's also easy to work around once you are aware of the issue.

boxplot() doesn't require use of a formula; it's a more user friendly way to plot multiple boxes, but you can call the function with a matrix/data frame. Also, if one introduces plot(y ~ x, data = foo) rather than plot(foo$x, foo$y) early on, users soon learn the utility of formulas in R and how to write and what they mean.

For cleaning data, I rarely use anything else but R for this. Probably that's because I know a lot more R than bash or python or some other such language.

@gavinsimpson
Copy link
Contributor

Rather than just tell people why R is so great, why not show them an excellent example?

One that is used quite often for a more statistically-minded group is creating a bootstrap confidence interval on the kernel density estimate of the Old Faithful Waiting Time data (faithful$waiting), using replicate() to do the bootstrapping. A handful of lines of code gets you the KDE, a bootstrap confidence interval and a plot with very little effort. Doing this in another language would require a lot more coding effort. The point though is not to poke fun at the other languages but to highlight that R as a language is designed for rapid, interactive data analysis.

Code

kde <- with(faithful, density(waiting))
from <- min(kde$x)
to <- max(kde$x)
boots <- with(faithful, replicate(10000, {
    samp <- sample(waiting, replace = TRUE)
    density(samp, from = from, to = to)$y
}))
ci <- apply(boots, 1, quantile, probs = c(0.025, 0.975))
plot(kde, ylim = range(ci))
polygon(c(kde$x, rev(kde$x)),
        c(ci[1, ], rev(ci[2, ])), col = "grey", border = FALSE)
lines(kde, lwd = 2)

faithful-boot

@naupaka
Copy link
Member

naupaka commented Jul 23, 2014

Another +1 to adding a mention of knitr/rmarkdown to create nice reports. People are always impressed when I show them how easy it is to make a nice deliverable (html or pdf) to share with their collaborators and/or PI. I would also mention what a great resource RStudio is for coding in R, particularly for novices - built-in help, objects browser, tab completion(!!), not to mention more advanced things like git integration, etc. This makes R a lot more familiar for people coming from e.g. something like MATLAB.

I think it may also be worth mentioning briefly that different disciplines tend to have different 'default/go-to' languages. In ecology, for example, R is certainly the 'go-to', which means a lot of code from manuscripts, or for new analysis methods, is in R. I think other disciplines have other defaults, e.g. python or MATLAB or etc.

@dhaine
Copy link
Contributor

dhaine commented Jul 23, 2014

Regarding data cleaning, an R novice might be novice to other languages too. So I don't think it will help him/her to refer to other languages (depends on the audience). Also as a novice, you might prefer to do as much as you can in a single environment (i.e. all in R)

@benmarwick
Copy link
Contributor

@dhaine agree completely with both of those points. Seems a more consistent approach with the novice student as someone coming to the command line for the first time.

@sritchie73
Copy link
Contributor Author

Great to see a lot of discussion in the second day of the bootcamp!

@cboettig agreed on the syntax, but better to not demotivate novices by saying its hard straight away. Personally I think the major reason things are so difficult is most courses don't teach the basic data structures and how to access them, instead focussing solely on statistics, so we've all had to struggle through it. I'm a big advocate of Hadley Wickham's Advanced R in that regard.

I haven't seen tidyr, I will have to check it out!

@dhaine I completely agree with you. It's not helpful for novices to compare to other languages (unless they come from a Comp Sci background), but we also shouldn't be advocating any one language as the be-all and end-all solution.

@gavinsimpson stringsAsFactors is a doubled edged sword, its good for making sure categorical data is handled correctly, but I've been bitten a number of times in the past where its clobbered my id column (where row.names=1 hasn't automatically worked), and has caused problems downstream when merging data frames, or making conclusions about which variables are significant.

On giving motivating examples, I believe @gvwilson 's intention was to have the pitches to be quite short, ~3 mins each. See pitch.html for the swcarpentry pitch, its quite short, so I'm not sure how much room there is for a motivating example.

@gavinsimpson
Copy link
Contributor

@sritchie73 as a pitch, you aren't going to need to explain what each line of code does, you just need to explain the general steps (KDE in line 1, bootstrap on lines 4-7, CI on line 8, rest plotting in the above example), point to the efficiency of the small amount of code needed to do this and point to the result. We don't even need to have just one example but perhaps a few to choose from or insert your own favorite.

The pitch needs to be more than trust me on these things language x is great because it will save you time / allow you to do x, y, & z once you've invested a bit of time learning. At least that's been my experience.

My point re stringsAsFactors is that your personal experience with this shouldn't colour the presentation. If you mention it at all you need to indicate the utility of the default and point out the negative that it might result in data being stored in factor rather than character formats. To be honest, if you don't have time to include a great motivating example why are we even discussing a minor gotcha that is quick to work around in the pitch? There are far more important negative sides to R like relative slow speed, inconsistent function and argument naming in the base language and functionality, etc.

Re your reply to @dhaine I agree that knowing about a range of languages is helpful, right tool for the job and all, but that doesn't mean R isn't an easy or useful language for data manipulation/processing. This isn't a negative against R, it isn't bad at data processing.

@ramnathv
Copy link

My 2 cents on the motivating example idea. I strongly concur with @gavinsimpson that to gain people's trust, it is best to walk the talk and show them how a few lines of code could do something for them, that would typically take many lines of code in other languages. Having a carefully curated list of examples will allow instructors to pick the one most relevant to their audience.

@jdblischak jdblischak mentioned this pull request Jul 26, 2014
@gvwilson
Copy link
Contributor

@jdblischak @dhaine Please merge if you think this is close enough.

@jdblischak
Copy link
Contributor

My understanding is that this content is being merged into #628.

@jdblischak jdblischak closed this Oct 1, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants