Skip to content
This repository has been archived by the owner on Jan 3, 2018. It is now read-only.

Domain-specific bioinformatics example #532

Closed
rbeagrie opened this issue Jun 9, 2014 · 33 comments
Closed

Domain-specific bioinformatics example #532

rbeagrie opened this issue Jun 9, 2014 · 33 comments

Comments

@rbeagrie
Copy link
Contributor

rbeagrie commented Jun 9, 2014

A couple of people have said over the past month or so that a domain specific example that involves handling next-generation sequencing data would be a very useful thing to have - so it seemed like a good project for the July sprints next month.

I'm starting with the assumption that this will be a 1-2 hour exercise to sit right at the end of a SWC bootcamp. Since the people who are interested seem to be quite diverse, I would really appreciate it if people could leave some comments as to what might be most useful to them. This would really help us focus down on the right direction to be working in come July.

Specifically, if you would be interested in teaching with a domain specific example, it would be great if you could let us know:

  1. Who would your target audience be, beginners, intermediates or something else?
  2. Are there any tools you would like to be covered (I'm thinking bedtools, SAMtools, HTSeq, DE-seq etc.)
  3. Should we aim more towards giving learners something they can do right now (e.g. making a bigwig file and uploading it to UCSC) or something that showcases the things they will be able to do with a bit more independent study (e.g. RNA-seq analysis)
  4. How much should we try to tie in with the relevant set of language lessons, and which language? For example, should this be something that learners can follow having covered only the concepts in novice R, or intermediate python etc?
  5. Are there any other parts of the core Software Carpentry curriculum we can be trying to reinforce (e.g. SQL)

If you have any other comments/thoughts/opinions please let us know!

@ctb
Copy link

ctb commented Jun 10, 2014

@rbeagrie, we do this for a living in the ANGUS course, albeit in a slightly different format (worked examples followed by free time for people to work with their own data) - see http://ged.msu.edu/angus/ for way too much info. Perhaps we can help by providing some fodder!

variant calling:

De novo mRNAseq and metagenome assembly:

More later.

@stephenturner
Copy link
Contributor

I'm working on developing an RNA-seq workshop that can be completed in a day similar to what you described (alignment with tophat, count with featureCounts in linux shell; analysis with DESeq in R). This is very much a work in progress at this point, but you can see what I'm doing here:

https://github.com/stephenturner/teaching/tree/master/rna-seq

The basic idea: I downloaded some data from GEO, mapped everything, analyzed with DESeq, picked some interesting regions on chromosome 4, and then extracted FASTQ files from those regions from the bam files. This way participants will have very small fastq files to work with that should map quickly, and they can index a single chromosome for read mapping so as to reduce RAM requirements to the point that this would work on a VM running on an average laptop.

I'm happy to continue developing in this repo, and/or move this over to the SWC repo (with some help from someone with better git-fu than me, to help move material already under VC in a separate repo to the swcarpentry/bc repo while preserving the commit history).

@gvwilson
Copy link
Contributor

@rbeagrie
Copy link
Contributor Author

These are all great examples, and I think it will definitely help to have something to work from during the sprints, rather than just starting from a blank slate. I think they nicely illustrate the two ways I think this could go:

  1. IMO, the examples from @ctb and @stephenturner would work well at the end of an intermediate workshop, where people would hopefully have enough experience to deal with installing the extra software they would need. I'm not sure something like this would fit at the end of a novice bootcamp though. Getting a group of 40 novices to install a mapper, plus two or three other downstream tools could be a bit of a nightmare, and I wouldn't want to end the two days with a straight demo that people couldn't follow along with themselves - that runs the risk of being a little demotivating
  2. @gvwilson's example would sit really well at the end of a novice bootcamp as it relies on unix tools that people would mostly have already used. On the other hand, I don't think it would work that well for intermediates, where I would prefer to show them tools written for NGS data that they could actually take into their own analyses.

This is why I think it's super important to nail down who and what this example is going to be for. As far as I'm aware, most of our bootcamps are still aimed at novices so I would lean towards something more like option 2. On the other hand, if most of the people who would use a domain specific example are running intermediate workshops then something more like option 1 would make sense.

If we decide that this is going to be most useful sitting at the end of a novice bootcamp, I think it ought to try to tie in as much of the core curriculum as possible. Following the ANGUS example, I like the idea of having novices clone a repository with some analysis already done and adding some extra bits, as you can reinforce and tie together program design, unix shell and version control all at once... the instructor could even have them issue pull requests against the original repo and code review each other's work (which is great as people could submit/comment even after the bootcamp has finished if they run out of time).

@ctb
Copy link

ctb commented Jun 10, 2014

Whoops, forgot the key point: you can't run any of the assembly stuff on most people's laptops. The variant calling could be done, but a virtual machine is probably the best way to go. In practice, I would strongly urge people to use a VM if they're doing anything NGS-y. @stephenturner, is this true for the reference-based RNAseq analysis software too?

@stephenturner
Copy link
Contributor

Working to get RAM requirements under 2G by indexing only a single chromosome and mapping only reads to that chromosome. Should be possible. And yes, despite the limitations the only way I'd teach this in a bootcamp is distributing a VM with software pre-installed.

@ctb
Copy link

ctb commented Jun 10, 2014

On Tue, Jun 10, 2014 at 12:44:47PM -0700, Stephen Turner wrote:

Working to get RAM requirements under 2G by indexing only a single chromosome and mapping only reads to that chromosome. Should be possible. And yes, despite the limitations the only way I'd teach this in a bootcamp is distributing a VM with software pre-installed.

OK, same strategy I use :)

@hdashnow
Copy link

There are already some well developed NGS tutorials and the infrastructure to run them. For example we (vlsci.org.au) and others have developed this material https://genome.edu.au/wiki/Learn for our workshops. Andrew Lonie (http://vlsci.org.au/researcher/alonie) might be able to point you towards other resources that you could use or adapt.

@rbeagrie
Copy link
Contributor Author

Hmm. I still feel quite strongly that a custom VM is not the best way to go in this specific instance, as it would massively cut down on the number of people that could potentially use this at the end of a novice bootcamp.

I propose a 1.5 hour example that could be done by a learner at the end of a novice bootcamp, involving investigating a FastQ file of unknown origin. I would break it up like this:

First half hour: Exploring FastQ files using commands covered in shell lectures (head, tail, wc etc) - based on the old v4 lesson Greg linked to

Second half hour: Parsing FastQ files using biopython, introducing quality strings etc, based on Will Trimble's biopython lesson from last year's Tufts bootcamp

Last half hour: BLASTing the first 50 or so reads from the FastQ file using BioPython's interface to the BLAST web service to find out what organism the data is from - inspired by Titus' zero entry BLAST stuff. Then a 10 minute wrap up with a brainstorming session on what problems learners might apply this sort of stuff to from their own research.

@ctb
Copy link

ctb commented Jun 11, 2014

A few comments, and then I'll leave you alone --

  • BLAST is, scientifically speaking, the wrong tool to use to explore Illumina reads. The reads are too short to contain enough information for highly specific matches in many cases, and the default indel scoring is bad for BLAST. I think it's a bad idea to teach BLASTing short reads, because people would then expect to use it!
  • If people hear NGS they're going to want to do something like mapping (variant calling or mRNAseq) or assembly. Nobody should be looking at raw reads with Python - what are they going to do with 'em, anyway? Except for BLAST? :)
  • Realistically, almost no one is going to do much with NGS data on their own laptop. For one, both assembly and the SAM/BAM steps of mapping require lots of memory and/or disk space and/or time. Plus, most of the tools only work on UNIX, which few people have installed. So experience with either a cloud VM or a local VM is going to be true to their future needs. (A lot more detailed argument here: http://ivory.idyll.org/blog/bioinformatics-training-suites.html) This point is the main reason why I feel like bioinformatics, or at least the NGS-y part of it, is a poor fit with the traditional Software Carpentry approach.
  • I don't know how widely used BioPython is, but I'm -0 on teaching new people about it -- I don't know anyone who uses it in earnest, myself. The only positive suggestion I have is to look at screed, which makes sequence loading pretty straightforward and is purely native Python to boot, but it's a product of my lab, so I'm biased :). Anyway, here's some docs: http://screed.readthedocs.org/en/latest/screed.html#quick-start

@gvwilson
Copy link
Contributor

@rbeagrie wrote:

Hmm. I still feel quite strongly that a custom VM is not the best way
to go in this specific instance, as it would massively cut down on the
number of people that could potentially use this at the end of a
novice bootcamp.

I've had poor results using VMs in the classroom: they won't run well on
older/slower machines, and people get lost in "wait, what's the keyboard
shortcut for pasting when I'm in this window?" On the other hand,
@ctb has had good luck getting people to run on cloud VMs - Titus, care
to weigh in?

@ctb
Copy link

ctb commented Jun 11, 2014

Sure -- all my bootcamps either bring up Amazon VMs for people (as with zero-entry workshops) or I teach people how to bring up their own Amazon VM (in workshops that are longer than a few days). The argument, again, is that people will actually be analyzing their NGS data on remote machines, so taking the time to introduce them to logins & remote command line doesn't harm.

@stephenturner
Copy link
Contributor

@ctb have you ever gotten Amazon to give vouchers or anything for AWS usage? Or do you get participants to enter their billing / CC info? How much does this end up costing for a few hours of compute on a small dataset for a 1-2 day workshop?

@rbeagrie
Copy link
Contributor Author

@ctb please don't feel like I want you to leave me alone! I pretty much agree entirely with your points. Especially this: "bioinformatics, or at least the NGS-y part of it, is a poor fit with the traditional Software Carpentry approach".

My big picture thinking here is that we want something that would allow someone to do:

$ swc get NGS-capstone

And get one short lesson that can round off a novice bootcamp, and show learners how they can apply what they have learned to their own 'NGS-y' research. It's entirely possible that there is nothing you can teach in a couple of hours that is best practice, relies only on the core software we ask people to install as part of a novice bootcamp and that doesn't require a VM. If so, that's fine. I definitely agree that in 99% of cases, if you are iterating over raw sequencing reads in python (or any other language) you are probably "doing it wrong".

One possible compromise would be a set of "work through these yourself" challenges with samtools and bedtools. They are widely used tools, and if you use them correctly they allow you to accomplish a lot on your own laptop without much RAM. The (big) compromise here is that helpers and instructors will likely spend most of the lesson dealing with installation headaches - hence why it would have to be diy challenges. The upside is that everyone goes away with a versatile toolset that will actually help them get stuff done.

@ctb
Copy link

ctb commented Jun 11, 2014

Amazon is usually happy to provide $100 vouchers per student. For a 1-2 day workshop things usually cost less than $5; for a semester long course, most students don't go over $100.

@ctb
Copy link

ctb commented Jun 11, 2014

@rbeagrie ;). I like the idea of the Web BLAST; maybe use the whole-proteome-vs-whole-proteome bit (BLAST ecoli x salmonella; output CSV of matches) from the zero entry bootcamps? That would be an excellent motivator for biologists to understand how powerful this is.

@stephenturner
Copy link
Contributor

Great discussion. One point @rbeagrie:

The (big) compromise here is that helpers and instructors will likely spend most of the lesson dealing with installation headaches

I'm developing some material for a workshop here that I'll eventually roll into this repo, but this is a compromise that I can't make when teaching without TAs/helpers. I can't see any way out of a desktop or cloud VM with at least a handful of tools pre-installed.

@jdblischak
Copy link
Contributor

Thanks for organizing this, @rbeagrie. Here are my thoughts:

  1. Who would your target audience be, beginners, intermediates or something else?

Any bootcamp pitched at biologists, no matter how it is advertised, will likely attract many novices. I think it would be best to just prepare for this.

  1. Are there any tools you would like to be covered (I'm thinking bedtools, SAMtools, HTSeq, DE-seq etc.)

For the purpose of doing something interesting while reinforcing skills learned during a bootcamp, I think a strong focus on bedtools is the best option.

  1. How much should we try to tie in with the relevant set of language lessons, and which language? For example, should this be something that learners can follow having covered only the concepts in novice R, or intermediate python etc?

I think good prerequisites would be novice shell and then either novice R or novice Python. There should be many bootcamps that cover this material and thus be able to use this lesson at the end.

Bigger picture, what is the goal of this lesson? Seeing as it will only take place for a few hours at the end of an already information crammed SWC bootcamp, I don't think it is feasible for the goal to be 'Teach attendees to perform an RNA-seq analysis from fastq files to list of DE genes.' This is just too far out of scope and would have to be so rushed that it would not be covered in any more depth than if the attendees just read through the basic documentation themselves. I think a better goal would be 'Show students how the basic computing skills learned during the bootcamp can be used for routine bioinformatics tasks.' This could be accomplished by piping together some bioinformatics command line tools with unix utilities, and then reading the result into Python or R and creating a quick visualization.

@ctb
Copy link

ctb commented Jun 12, 2014

I agree with everything that @jdblischak says with one very important exception: most of the novice biologists I interact with have neither grounding nor specific motivation for learning anything Software Carpentry, and all the feedback I've gotten suggests that starting with a traditional SWC topic set (shell, Python) is total fail for novice biologists. I've heard from others with similar experiences.

@rbeagrie
Copy link
Contributor Author

@ctb hmm not sure I can agree with you 100% there. I know several novice biologists who've been to "zero entry" bioinformatics workshops that didn't cover the shell, and they were completely lost. I take your point that a completely "off the shelf" SWC bootcamp may not be the best approach with novice biologists. However, given that people are offering these types of workshops, I think it's worthwhile giving the best demonstration we can of how the skills people have learned can be applied to their research. Considering all the caveats we've been discussing, I tend to agree with @jdblischak that bedtools is probably the best option.

@stephenturner
Copy link
Contributor

@jdblischak , @rbeagrie : I also like the idea of teaching bedtools, but are we at risk of thinking "what do we want to teach" instead of "what do students want to learn?" I'd wager that a good number of average "biologists" (however we're defining that here) would be much more interested in some kind of finished product data analysis that's relevant to their field of study - an assembled genome, a list of differentially regulated genes from an RNA-seq experiment, annotated variants, etc. Bedtools is an infinitely useful and indispensable tool in any bioinformatician's toolbox, but I'm not completely convinced wrapping up the day with teaching a biologist how to munge genomic intervals will have a motivating and lasting impact unless that biologist eventually gets more involved in a bioinformatics lab.

@rbeagrie
Copy link
Contributor Author

So are we saying that we should be discouraging people from running SWC bootcamps aimed at novice biologists, in favour of a proper data analysis (RNA-seq or whatever) workshop?

@stephenturner
Copy link
Contributor

I don't think they're mutually exclusive. A fair number of biologists came to the two SWC bootcamps we had here, and I believe they got a lot out of it. But perhaps a domain-specific bioinformatics exercise might be better if it resulted in the participant reaching some analytical endpoint - assembly, gene list, etc (easier said than done, admittedly).

@rbeagrie
Copy link
Contributor Author

I guess my opinion is that they are mutually exclusive. I'm not convinced you can teach someone something meaningful about differential expression analysis in only a couple of hours. In fact I can't think of any analytical endpoint that can be fully explored in less than a day...

@wking
Copy link
Contributor

wking commented Jun 12, 2014

On Thu, Jun 12, 2014 at 01:41:38PM -0700, Rob Beagrie wrote:

I'm not convinced you can teach someone something meaningful about
differential expression analysis in only a couple of hours.

Teaching the science behind the analysis (and when that particular
analysis makes sense) is probably out of scope for a two-day workshop
(and certainly is if you'll be discussing other things like a stock
SWC workshop). Just because folks won't be taking it back to their
lab unaltered doesn't mean that a short capstone example is a bad
idea.

@rbeagrie
Copy link
Contributor Author

Well I certainly agree that RNA-seq would be a better motivator if we can teach something in 2 hours. @stephenturner how long would you normally set aside to teach the RNA-seq example you posted above?

@stephenturner
Copy link
Contributor

I think we all agree that we can't go into any kind of detail on theory/motivation behind any biological data analysis. But I think we can wrap up with some practical example in 1-2 hrs. Couple ideas:

  1. Given that the bootcamp covers R instead of python, we could start with a count matrix and run it through DESeq. There are only a handful of commands since DESeq2 now wraps the entire pipeline in a single DESeq function.
  2. If the bootcamp didn't cover R but covered python, we could do something with BEDTools like someone mentioned earlier. One idea: given bed file of some interesting regions (ChIP peaks, dysregulated genes, etc), and another set of regions, say, some ENCODE regions of interest, you want to ask is there significant over-representation of ENCODE features among your "interesting" features (this is actually a common question, not just some toy example). You can teach a little bedtools intersecting, discuss permutation theory, then set up a python program to do a few thousand bedtools shuffles, keep a log of your bedtools intersect results, and end up calculating a permutation p-value.

@ctb
Copy link

ctb commented Jun 13, 2014

The bootcamp could also end with some plotting -- MA plots or other common
differential expression plots. See bottom of

https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/8-differential-expression.html

for one of our examples.

@stephenturner
Copy link
Contributor

Agreed. With R/DESeq, plotting easy with functions coming with DESeq2
package (ma plots, volcano plots, etc). Vignette also gives code to produce
others. I imagine something similar could be done with the hypothetical
BEDTools example I gave earlier - something like plotting a histogram of
permutation intersect results with the actual result way out on the tail.
Etc.

On Fri, Jun 13, 2014 at 9:40 AM, C. Titus Brown notifications@github.com
wrote:

The bootcamp could also end with some plotting -- MA plots or other common
differential expression plots. See bottom of

https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/8-differential-expression.html

for one of our examples.


Reply to this email directly or view it on GitHub
#532 (comment).

@rbeagrie
Copy link
Contributor Author

OK great, I'm very happy with this plan! We can use @stephenturner's RNA-seq example as a starting point on the R side. I'll have a look for any bedtools examples we can build on from the python side, unless anyone can suggest any?

@stephenturner
Copy link
Contributor

Note: those materials are very much a works-in-progress. Hoping to make some progress on that front in the next couple weeks.

@rbeagrie
Copy link
Contributor Author

I've had a look around the net for BEDtools tutorial that we can build on and I like this one from Aaron Quinlan's CSHL course: https://github.com/arq5x/tutorials/blob/master/bedtools.md

I would propose to keep up to "Counting the number of overlapping features.", then add a section explaining bedtools shuffle. We can have learners write or correct a bash script that shuffles 1000 times, then a python script to read the results and plot the distribution of permuted overlaps compared to the real overlap.

@gvwilson
Copy link
Contributor

gvwilson commented Feb 3, 2016

Data Carpentry is doing this.

@gvwilson gvwilson closed this as completed Feb 3, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants