# Homework 1: File formats and TF binding sites

By the end of this homework you will...

* Know what common genome annotation formats are (BED, GTF)
* Used knowledge of UNIX commands to ask biological questions
* Know how to use the UCSC genome browser to download BED files
* Loaded `biotools` on TSCC to get access to the `bedtools` program
* Used `bedtools` to get promoters of genes, overlap them with transcription factor binding sites 
* Copied data from TSCC to your laptop
* Become more comfortable with reading and navigating documentation and man pages

## Setup

This assumes you have already set up [TSCC](2_tscc_setup.ipynb), [Jupyter notebooks on TSCC](4_juypter_setup.ipynb), and [Cloned the `biom262-2016` github repo to your `~/code` directory.

1. Log in to TSCC
2. Start a Jupyter notebook server (`jupyter notebook --port #### --no-browser &`)
3. Navigate to `~/code/biom262-2016`
4. Create a branch
3. Tunnel the server back to your laptop (`ssh -NL ####:localhost:#### username@tscc-login2.sdsc.edu &`)
4. Go to `http://localhost:####` in your browser
5. Navigate to `~/code/biom262-2016/weeks/week01` in the Jupyter Notebook browser.

### If this notebook on the website looks different than what you have

You'll need to get the most updated one from `biom262/biom262-2016`, which we set up as the "`upstream`" repo [previously](http://nbviewer.ipython.org/github/biom262/biom262-2016/blob/master/weeks/week01/5_fork_clone_biom262-2016_repo.ipynb). To do that, do:

```
git pull upstream master
```

* * *

Now that you're running this notebook, **run the cell below using Shift+Enter** to change directories to `~/code/biom262-2016/weeks/week01/data` to be in the same directory as the data.


In [13]:
cd ~/code/biom262-2016/weeks/week01/data

[Errno 2] No such file or directory: '/Users/olga/code/biom262-2016/weeks/week01/data'
/Users/olga/workspace-git/biom262-2016-homework/source/week01


## What is an annotation?
An annotation is a flag denoting a special characteristic to a particular base pair or string of base pairs in a DNA sequence.  The most common type of annotation assigns particular chromosomal coordinates to a gene ID (and its associated metadata.)  Genetic sequence is most commonly stored in strings of 'A', 'C', 'G', and 'T' (fasta format).  Annotations are sometimes bundled with sequence (gbk, genbank file) or downloaded separately in memory efficient formats like GTF and BED.

### BED Format: stuff in the genome

BED stands for "Browser Extensible Data" (informative I know .....) and is a standard format in bioinformatics for describing locations of stuff in the genome the basic format is described [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format1), and here's a minimal example (from [`pybedtools`](https://github.com/daler/pybedtools/blob/master/pybedtools/test/data/a.bed) tests):

```
chr1	0	100	feature1	0	+
chr1	100	200	feature2	0	+
chr1	150	500	feature3	0	-
chr1	900	950	feature4	0	+
```

As you can see, the format is:

1. Chromosome name
2. Start of the feature (0-based, so the first base of the genome is "0")
3. Stop of the feature (0-based, so not inclusive). This is for computer science reasons, most programming langauges are written such that the "0th" item is the first thing. So then the "100th" item would then actually be the 101th element. To avoid the "off by one error" (a common problem in bioinformatics) then we use 0-based indexing, where `feature1` above starts at base 0 and ends at base 100, and thus is of length 100.
4. Name of the feature
5. Score of the feature, which is some integer from 0 to 1000. Some bed files come with a dot/period "`.`" here instead if it doesn't make sense for them to have a score, but there's programs that will complain that your `bed` file is improperly formatted so I end up using `awk` or something to fill that column with 1000 for every row.
6. Strand of the feature. If a strand doesn't make sense (e.g. for DNA methylation) then a dot/period "`.`"  is here.

### Get Transcription Factor (TF) Binding Site BED files from the UCSC Table Browser

Go to the [UCSC Genome Browser](http://genome.ucsc.edu/) and click "Tools > Table Browser", as shown below.

![](images/table_browser.png "UCSC Table Browser")

1. Go to the [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables) 
2. Choose "clade: Mammal," "genome: Human," and "assembly: Feb. 2009 (GRCh37/hg19)"
3. Choose "group: Regulation", "track: Txn Fac ChIP V2"
3. Use the default table (there are no others anyway)
4. Put "chr22" for the Region
5. Choose "output format: BED - browser extensible data". 
4. Save as "`tf.bed`"
4. Click "get output".
4. Do one bed record per gene (They really mean per "item", not a whole gene)
5. Click "get BED"

What does this file look like? Is it similar to what we had before? Let's use `head` to look at the first few lines, which by default is 10 lines.


Hint: Remember that the dollar sign ("`$`") indicates the shell command prompt that you're in, and shouldn't be included when you're copying commands.

```
$ head ~/Downloads/tf.bed     
chr22	16166497	16166741	CTCF	186	.	16166497	16166741	0	1	244,	0,
chr22	16201947	16202317	CTCF	603	.	16201947	16202317	0	1	370,	0,
chr22	16201988	16202252	YY1_(C-20)	63	.	16201988	16202252	0	1	264,	0,
chr22	16202021	16202231	Rad21	92	.	16202021	16202231	0	1	210,	0,
chr22	16202128	16202242	E2F6_(H-50)	269	.	16202128	16202242	0	1	114,	0,
chr22	16205233	16205683	SETDB1	356	.	16205233	16205683	0	1	450,	0,
chr22	16325696	16325900	NRSF	36	.	16325696	16325900	0	1	204,	0,
chr22	16560670	16560910	MafK_(ab50322)	102	.	16560670	16560910	0	1	240,	0,
chr22	16872258	16872628	CTCF	279	.	16872258	16872628	0	1	370,	0,
chr22	16872319	16872563	SMC3_(ab9263)	309	.	16872319	16872563	0	1	244,	0
```

### Copy `tf.bed` to TSCC

#### Mac/Linux

To copy data between computers, you can use "secure copy" or "`scp`". Just like with `cp` you do `cp filename newplace` you can do the same thing, but you have to specify the server name, and where on the server you want to put it. 

*Note: you are running this command from your **laptop**, because you need to be able to specify a unique address/URL for one of the file locations. Unless you have command line login and a static IP for your laptop (which takes a lot of work and is therefore unlikely), then you cannot run `scp` from TSCC to copy files to/from your laptop. You must always run `scp` from your laptop if you want to move files to/from a server.*

```
$ scp tf.bed username@tscc.sdsc.edu:~/code/biom262-2016/weeks/week01/data
```

Written another more generically, the `scp` command looks like

```
$ scp localfile username@server:/path/to/new/place/on/server
```

#### Windows

(Instructions compiled from [Cornell IT](http://www.it.cornell.edu/services/managed_servers/howto/file_transfer/fileputty.cfm) and [Analyzing Next-Gen Seq (ANGUS) data workshop](http://ged.msu.edu/angus/tutorials/using-putty-on-windows.html))

Install [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html). It doesn't need to be installed so you can put the file on your desktop or wherever you like to put your programs. You'll need both the `putty.exe` and `pscp.exe` files.

Open the windows command line 

![](http://www.it.cornell.edu/catc/cms/services/managed_servers/howto/file_transfer/images/filewindowsstart_1.jpg)
![](http://www.it.cornell.edu/catc/cms/services/managed_servers/howto/file_transfer/images/filewindowscmd_1.jpg)

At the command prompt, you can now enter same command as for Mac/Linux, but using `pscp` instead of `scp`.

```
c:\>pscp tf.bed username@tscc.sdsc.edu:~/code/biom262-2016/weeks/week01/data
```

Written another more generically, the `pscp` command looks like

```
c:\>pscp localfile username@server:/path/to/new/place/on/server
```

### Exercise 1: Get NFKB binding sites only

Use your UNIX skillz to filter the `tf.bed` file for only the NFKB ([Nuclear factor kappa B](http://en.wikipedia.org/wiki/NF-%CE%BAB)) TFs. Save this file as `tf.nfkb.bed`. This file should have 702 lines.

In [14]:
%%bash --out exercise1
### BEGIN SOLUTION
cat tf.bed | grep NFKB  > tf.nfkb.bed
### END SOLUTION

wc -l tf.nfkb.bed
echo '--- First 10 lines ---'
head tf.nfkb.bed
echo '--- Random 10 lines ---'
awk -v seed=907 'BEGIN{srand(seed);}{ if (rand() < 0.5 ) {print $0}}'  tf.nfkb.bed | head
echo '--- Last 10 lines ---'
tail tf.nfkb.bed

The following cell is for your debugging purposes. Due to the way we have to set up the notebook to grade it, we add `--out exercise1` next to the `%%bash` magic to save the output of the previous cell to the variable `exercise1`, so here we're printing the output. This is what *would* be shown below the cell if we weren't doing this workaround. This is what is graded.

In [15]:
print(exercise1)

     702 tf.nfkb.bed
--- First 10 lines ---
chr22	17565763	17566145	NFKB	406	.	17565763	17566145	0	1	382,	0,
chr22	17568509	17568945	NFKB	354	.	17568509	17568945	0	1	436,	0,
chr22	17700141	17700502	NFKB	525	.	17700141	17700502	0	1	361,	0,
chr22	17701962	17702278	NFKB	295	.	17701962	17702278	0	1	316,	0,
chr22	17709521	17709797	NFKB	300	.	17709521	17709797	0	1	276,	0,
chr22	17724391	17724724	NFKB	413	.	17724391	17724724	0	1	333,	0,
chr22	17738401	17738677	NFKB	462	.	17738401	17738677	0	1	276,	0,
chr22	17738857	17739282	NFKB	573	.	17738857	17739282	0	1	425,	0,
chr22	17761088	17761435	NFKB	244	.	17761088	17761435	0	1	347,	0,
chr22	17784788	17785118	NFKB	1000	.	17784788	17785118	0	1	330,	0,
--- Random 10 lines ---
chr22	17700141	17700502	NFKB	525	.	17700141	17700502	0	1	361,	0,
chr22	17701962	17702278	NFKB	295	.	17701962	17702278	0	1	316,	0,
chr22	17709521	17709797	NFKB	300	.	17709521	17709797	0	1	276,	0,
chr22	17738401	17738677	NFKB	462	.	17738401	17738677	0	1	276,	0,
chr22	17738857	177392

In [16]:
answer1 = '''702 tf.nfkb.bed
--- First 10 lines ---
chr22	17565763	17566145	NFKB	406	.	17565763	17566145	0	1	382,	0,
chr22	17568509	17568945	NFKB	354	.	17568509	17568945	0	1	436,	0,
chr22	17700141	17700502	NFKB	525	.	17700141	17700502	0	1	361,	0,
chr22	17701962	17702278	NFKB	295	.	17701962	17702278	0	1	316,	0,
chr22	17709521	17709797	NFKB	300	.	17709521	17709797	0	1	276,	0,
chr22	17724391	17724724	NFKB	413	.	17724391	17724724	0	1	333,	0,
chr22	17738401	17738677	NFKB	462	.	17738401	17738677	0	1	276,	0,
chr22	17738857	17739282	NFKB	573	.	17738857	17739282	0	1	425,	0,
chr22	17761088	17761435	NFKB	244	.	17761088	17761435	0	1	347,	0,
chr22	17784788	17785118	NFKB	1000	.	17784788	17785118	0	1	330,	0,
--- Random 10 lines ---
chr22	17700141	17700502	NFKB	525	.	17700141	17700502	0	1	361,	0,
chr22	17701962	17702278	NFKB	295	.	17701962	17702278	0	1	316,	0,
chr22	17709521	17709797	NFKB	300	.	17709521	17709797	0	1	276,	0,
chr22	17738401	17738677	NFKB	462	.	17738401	17738677	0	1	276,	0,
chr22	17738857	17739282	NFKB	573	.	17738857	17739282	0	1	425,	0,
chr22	17761088	17761435	NFKB	244	.	17761088	17761435	0	1	347,	0,
chr22	17784788	17785118	NFKB	1000	.	17784788	17785118	0	1	330,	0,
chr22	18111762	18112038	NFKB	453	.	18111762	18112038	0	1	276,	0,
chr22	18225198	18225524	NFKB	428	.	18225198	18225524	0	1	326,	0,
chr22	18278491	18278848	NFKB	1000	.	18278491	18278848	0	1	357,	0,
--- Last 10 lines ---
chr22	50969418	50969694	NFKB	415	.	50969418	50969694	0	1	276,	0,
chr22	50970799	50971099	NFKB	298	.	50970799	50971099	0	1	300,	0,
chr22	50978032	50978392	NFKB	1000	.	50978032	50978392	0	1	360,	0,
chr22	50979332	50979632	NFKB	307	.	50979332	50979632	0	1	300,	0,
chr22	50980831	50981158	NFKB	365	.	50980831	50981158	0	1	327,	0,
chr22	51001317	51001593	NFKB	380	.	51001317	51001593	0	1	276,	0,
chr22	51021040	51021316	NFKB	484	.	51021040	51021316	0	1	276,	0,
chr22	51021478	51021754	NFKB	363	.	51021478	51021754	0	1	276,	0,
chr22	51058900	51059230	NFKB	498	.	51058900	51059230	0	1	330,	0,
chr22	51060107	51060383	NFKB	354	.	51060107	51060383	0	1	276,	0,'''

# Remove whitespace at the beginning and end of the exercise 1 result
exercise1 = exercise1.strip()
assert exercise1 == answer1

### Gencode gene annotations

We are interested in which gene's promoters these transcription factors bind to. We will use the [GENCODE](http://www.gencodegenes.org/) gene annotation, which is an aggregate of several gene annotation groups (ENSEMBL, HAVANA) to create a comprehensive set of human, and recently, mouse, gene annotations. We downloaded the annotations using [GENCODE v19](http://www.gencodegenes.org/releases/19.html), the last release using hg19, which as of writing in January 2016 has all the ENCODE transcription factor binding sites.

If you're interested in the difference between the GENCODE gene annotations and RefSeq genes, check out [this](http://www.biomedcentral.com/1471-2164/16/S8/S2) paper.

For this assignment, we have provided the gene annotations and sequence for human chromosome 22 (one of the smallest chromosomes), which are located in `biom262-2016/weeks/week01/data`

#### In case you are curious ...
This information is just so you know how grabbing just chromosome 22 for the different file formats was accomplished, here it is. It is not graded material. 

To get the `chr22` data from the gtf file, we can `grep` for lines that start with `chr22`, where the up-carat/power symbol "`^`" indicates the start of the line.

```
$ gunzip gencode.v24.basic.annotation.gtf.gz
$ grep '^chr22' gencode.v19.annotation.gtf > gencode.v19.annotation.chr22.gtf
```

To get the fasta file, we first had to index it using `samtools faidx`, and then we could use the same program to grab chromosome 22.

```
$ gunzip GRCh37.p13.genome.fa.gz 
$ samtools faidx GRCh37.p13.genome.fa.gz 
$ samtools faidx GRCh37.p13.genome.fa.gz chr22 > GRCh37.p13.chr22.fa 
```

#### Fasta and GTF formats

##### Fasta

The FASTA format is one of the oldest in bioinformatics and the simplest. The concept is this:

```
>sequence1_name 
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGGGGGGGGG
GGGGGGGGGG
>sequence2_name
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCC
CCCCCCCCCC
```

Where a sequence entry starts with a greater than sign "`>`" and any number of sequence lines can follow. The end of the sequence is indicated by the start of the next one (or the end of the file!). The example above shows two 110-base long sequences, with the first two rows of the sequence are exactly the same length (50bp), with a final hanging sequence (10bp). Some programs require that the lines be the same length, so if you're creating FASTA files by hand, it's recommended to use [BioPython's SeqIO](http://biopython.org/wiki/SeqIO) to generate them so they're correctly formatted.


We can use `head` to look at the first few lines (10 by default) of each of the files. The fasta file is kind of boring for the first few lines:


```
$ head GRCh37.p13.chr22.fa
>chr22
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
```

One question is, how many bases are in this file? Well, we'll need to ignore all rows of the file that start with "`>`", which we can do using `grep -v`, where "`-v`" *in**v**erts* the match and returns the non-matching lines. Here's the excerpt from `man grep`:

     -v, --invert-match
             Selected lines are those not matching any of the specified pat-
             terns.

To do this, we'll use `grep` to search for all lines that start with (startwith="`^`") the character "`>`". Remember that the greater than sign is a special character in unix that indicates to save the output to a file with that name, so we need to put quotes around the "regular expression" (aka search term) we are searching with to protect it from getting evaluated by the UNIX command line.

```
$ grep -v '^>' GRCh37.p13.chr22.fa | head
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
```

Let's make sure we've only removed the first line by looking at the line count.

```
wc -l 
 855078 GRCh37.p13.chr22.fa

```

And this command:

```
$ grep -v '^>' GRCh37.p13.chr22.fa | wc -l
  855077
```

So yes, we've removed exactly one line. Now we're ready to count the number of bases. Remember that the "`-l`" flag on "`wc`" indicated to count only the number of lines

```
$ grep -v '^>' GRCh37.p13.chr22.fa | wc   
  855077  855077 52159643
```

So that means there are 52,159,643 bases in chromosome 22. Cool!

##### GTF

Like the BED format, GTF encodes "stuff in the genome," but allows for hierarchy in the relationships between items and for additional annotations of the items. For example, GTFs can encode genes, which have transcripts as children, which have exons as children, which have coding sequences (CDS), untranslated regions (UTRs), stop codons, and more as children. This is not possible in BED files.

Read the GENCODE description of their [GTF](http://www.gencodegenes.org/data_format.html) format, which describes their columns and convenient anntations like whether the gene is protein-coding, the reading frame of coding sequences (CDS), and at which "level" a gene is annotated - verified, manually, or computationally.

Here is a screenshot of the GTF file from the GENCODE website.

![](images/gtf_example.png "GTF Example")

As you can see, it is *similar* but *different* than the bed file. Besides the columns, a key difference is that GTF files are 1-based but BED files are 0-based. Thus, if we have a feature which in BED-land:

```
chr1	0	100	feature1	0	+
```

Then in the GTF, we have:

```
chr1	source	gene	1	100	feature1	0	+	.
```

The GTF file is slightly easier to understand because the feature starts at 1 (and is 1-based so "1" is truly the first nucleotide) truly ends on base 99, and there is no more feature on base 100. How is this different for the negative strand?

For a bed file with the following 0-based coordinates, what would be the coordinates in the 1-based gtf file?

```
chr1	150	500	feature3	0	-
```

It'd be the same - the start is only changed. Why is that?

```
chr1	source	gene	151	500	feature3	0	-
```


Now let's look at GTF files on the command line. What if we do `head` on just the first two lines?

```
head -n 2 gencode.v19.annotation.chr22.gtf
chr22	HAVANA	gene	16062157	16063236	.	+	.	gene_id "ENSG00000233866.1"; transcript_id "ENSG00000233866.1"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "LA16c-4G1.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.3"; level 2; havana_gene "OTTHUMG00000140195.1";
chr22	HAVANA	transcript	16062157	16063236	.	+	.	gene_id "ENSG00000233866.1"; transcript_id "ENST00000424770.1"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "LA16c-4G1.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.3-001"; level 2; tag "basic"; havana_gene "OTTHUMG00000140195.1"; havana_transcript "OTTHUMT00000276574.1";
```


Oh that's annoying - GTF files are very wwwwwwwwwwwwwwiiiiiiiiiiiiiidddddddddddddddddddddeeeeeeeeeeeeeeeeeeeeeeeee and so the line wraps around and makes it are difficult to view in the terminal and line up columns. So we don't want to use `head`. Instead we'll use `less` with the `-S` flag. Check out `man less` to see **all** the options (there's a ton). Here's an excerpt from `man less` to show the function of `-S`:

       -S or --chop-long-lines
              Causes  lines  longer than the screen width to be chopped rather
              than folded.  That is, the portion of a long line that does  not
              fit  in  the  screen width is not shown.  The default is to fold
              long lines; that is, display the remainder on the next line.


```
$ less -S gencode.v19.annotation.chr22.gtf
```

It's nice because you can use the left and right arrow keys to navigate to different columns.

Here's a screenshot of the output:

![](images/gencode_v19_chr22_less-S.png "Example with 'less -S' to chop instead of wrap lines")

### Exercise 2: Extract only the genes from the GTF file

Use what you've learned about UNIX so far to get only the rows of `gencode.v19.annotation.chr22.gtf` that contain features of type "`transcript`", saving to a file called `gencode.v19.annotation.chr22.transcript.gtf`. Your file should contain `4459` lines.

In [17]:
%%bash --out exercise2
### BEGIN SOLUTION
awk -F '\t' '{ if ($3 == "transcript") print $0; }' gencode.v19.annotation.chr22.gtf > gencode.v19.annotation.chr22.transcript.gtf
### END SOLUTION

wc -l gencode.v19.annotation.chr22.transcript.gtf
echo '--- First 10 lines ---'
head gencode.v19.annotation.chr22.transcript.gtf
echo '--- Random 10 lines ---'
awk -v seed=907 'BEGIN{srand(seed);}{ if (rand() < 0.5 ) {print $0}}'  gencode.v19.annotation.chr22.transcript.gtf | head
echo '--- Last 10 lines ---'
tail gencode.v19.annotation.chr22.transcript.gtf

The following cell is for your debugging purposes. Due to the way we have to set up the notebook to grade it, we add `--out exercise2` next to the `%%bash` magic to save the output of the previous cell to the variable `exercise2`, so here we're printing the output. This is what *would* be shown below the cell if we weren't doing this workaround. This is what is graded.

In [18]:
print(exercise2)

    4459 gencode.v19.annotation.chr22.transcript.gtf
--- First 10 lines ---
chr22	HAVANA	transcript	16062157	16063236	.	+	.	gene_id "ENSG00000233866.1"; transcript_id "ENST00000424770.1"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "LA16c-4G1.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.3-001"; level 2; tag "basic"; havana_gene "OTTHUMG00000140195.1"; havana_transcript "OTTHUMT00000276574.1";
chr22	HAVANA	transcript	16076052	16076172	.	-	.	gene_id "ENSG00000229286.1"; transcript_id "ENST00000448070.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-4G1.4"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.4-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140193.1"; havana_transcript "OTTHUMT00000276571.1";
chr22	HAVANA	transcript	16084249	16084826	.	-	.	gene_id "ENSG00000235265.1"; transcript_id "ENST00000413156.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_na

In [19]:
answer2 = '''4459 gencode.v19.annotation.chr22.transcript.gtf
--- First 10 lines ---
chr22	HAVANA	transcript	16062157	16063236	.	+	.	gene_id "ENSG00000233866.1"; transcript_id "ENST00000424770.1"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "LA16c-4G1.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.3-001"; level 2; tag "basic"; havana_gene "OTTHUMG00000140195.1"; havana_transcript "OTTHUMT00000276574.1";
chr22	HAVANA	transcript	16076052	16076172	.	-	.	gene_id "ENSG00000229286.1"; transcript_id "ENST00000448070.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-4G1.4"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.4-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140193.1"; havana_transcript "OTTHUMT00000276571.1";
chr22	HAVANA	transcript	16084249	16084826	.	-	.	gene_id "ENSG00000235265.1"; transcript_id "ENST00000413156.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-4G1.5"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.5-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140197.1"; havana_transcript "OTTHUMT00000276576.1";
chr22	HAVANA	transcript	16100517	16124973	.	-	.	gene_id "ENSG00000223875.1"; transcript_id "ENST00000420638.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "NBEAP3"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "NBEAP3-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140196.1"; havana_transcript "OTTHUMT00000276575.1";
chr22	HAVANA	transcript	16122720	16123768	.	+	.	gene_id "ENSG00000215270.3"; transcript_id "ENST00000398242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-60H5.7"; transcript_type "processed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-60H5.7-001"; level 1; ont "PGO:0000004"; tag "pseudo_consens"; havana_gene "OTTHUMG00000140200.1"; havana_transcript "OTTHUMT00000276581.1";
chr22	HAVANA	transcript	16147979	16192971	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000447898.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276780.1";
chr22	HAVANA	transcript	16150255	16193000	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000437781.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-003"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276579.1";
chr22	HAVANA	transcript	16150529	16193004	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000413768.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-001"; level 2; tag "non_canonical_polymorphism"; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276578.1";
chr22	HAVANA	transcript	16158798	16192995	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000383038.3"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.9-004"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276577.1";
chr22	HAVANA	transcript	16158829	16159470	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000607933.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.9-006"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000472240.1";
--- Random 10 lines ---
chr22	HAVANA	transcript	16084249	16084826	.	-	.	gene_id "ENSG00000235265.1"; transcript_id "ENST00000413156.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-4G1.5"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.5-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140197.1"; havana_transcript "OTTHUMT00000276576.1";
chr22	HAVANA	transcript	16100517	16124973	.	-	.	gene_id "ENSG00000223875.1"; transcript_id "ENST00000420638.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "NBEAP3"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "NBEAP3-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140196.1"; havana_transcript "OTTHUMT00000276575.1";
chr22	HAVANA	transcript	16122720	16123768	.	+	.	gene_id "ENSG00000215270.3"; transcript_id "ENST00000398242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-60H5.7"; transcript_type "processed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-60H5.7-001"; level 1; ont "PGO:0000004"; tag "pseudo_consens"; havana_gene "OTTHUMG00000140200.1"; havana_transcript "OTTHUMT00000276581.1";
chr22	HAVANA	transcript	16150255	16193000	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000437781.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-003"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276579.1";
chr22	HAVANA	transcript	16150529	16193004	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000413768.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-001"; level 2; tag "non_canonical_polymorphism"; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276578.1";
chr22	HAVANA	transcript	16158798	16192995	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000383038.3"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.9-004"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276577.1";
chr22	HAVANA	transcript	16158829	16159470	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000607933.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.9-006"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000472240.1";
chr22	HAVANA	transcript	16150776	16151397	.	-	.	gene_id "ENSG00000271672.1"; transcript_id "ENST00000456786.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DUXAP8"; transcript_type "transcribed_processed_pseudogene"; transcript_status "KNOWN"; transcript_name "DUXAP8-001"; level 1; ont "PGO:0000004"; ont "PGO:0000019"; tag "pseudo_consens"; havana_gene "OTTHUMG00000140194.3"; havana_transcript "OTTHUMT00000276572.3";
chr22	HAVANA	transcript	16162066	16172700	.	+	.	gene_id "ENSG00000232775.2"; transcript_id "ENST00000440946.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "AP000525.10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.10-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000140198.3"; havana_transcript "OTTHUMT00000276785.1";
chr22	HAVANA	transcript	16255355	16256477	.	-	.	gene_id "ENSG00000241838.2"; transcript_id "ENST00000417657.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-3G11.7"; transcript_type "processed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-3G11.7-001"; level 1; ont "PGO:0000004"; tag "pseudo_consens"; havana_gene "OTTHUMG00000140315.1"; havana_transcript "OTTHUMT00000276919.1";
--- Last 10 lines ---
chr22	HAVANA	transcript	51205934	51222090	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000395591.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "protein_coding"; transcript_status "NOVEL"; transcript_name "RABL2B-003"; level 2; tag "basic"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316608.1";
chr22	HAVANA	transcript	51205934	51222091	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000395595.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RABL2B-006"; level 2; tag "basic"; tag "appris_candidate_longest"; tag "CCDS"; ccdsid "CCDS33683.1"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316611.1";
chr22	HAVANA	transcript	51205937	51208930	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000465063.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "RABL2B-010"; level 2; tag "basic"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316615.1";
chr22	HAVANA	transcript	51205958	51222066	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000436958.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "RABL2B-002"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316607.1";
chr22	HAVANA	transcript	51207955	51214261	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000482308.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "RABL2B-013"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000349056.1";
chr22	HAVANA	transcript	51208210	51221714	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000464678.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "RABL2B-012"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000349055.1";
chr22	HAVANA	transcript	51209638	51222058	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000395590.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "RABL2B-004"; level 2; tag "basic"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316609.1";
chr22	HAVANA	transcript	51214199	51222028	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000468451.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "RABL2B-014"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000349057.1";
chr22	HAVANA	transcript	51216088	51222058	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000464740.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "RABL2B-015"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000349058.1";
chr22	HAVANA	transcript	51220662	51221473	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000413505.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RABL2B-011"; level 2; tag "alternative_5_UTR"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316616.2";'''

# Remove whitespace at the beginning and end of the exercise 2 result
exercise2 = exercise2.strip()
assert exercise2 == answer2

## Use Bedtools to get promoters and overlap with TF binding sites

[Bedtools](http://bedtools.readthedocs.org/en/latest/) is a "swiss-army knife of tools for a wide-range of genomics analysis tasks". In particular, `bedtools` ***excels*** at "genome algebra", the adding and subtracting of genomic regions together. It is used ***EXTENSIVELY*** in bioinformatics and there are few projects that can live without it (or creating their own personal version but who wants to do that?)

### Load `biotools` on TSCC which includes Bedtools

First, try running `bedtools`. You should see this:

```
$ bedtools
bash: bedtools: command not found
```

This is because your shell doesn't know anything about the command `bedtools`. To load `bedtools` and other bioinformatics tools to your TSCC account, do

```
module load biotools
```

This command has no output. To make sure we have `bedtools` available, use `which` to see the full path to `bedtools`:

```
$ which bedtools
/opt/biotools/bedtools/bin/bedtools
```

(This is a TSCC-specific thing that the nice system administrators have set up for us but is not a general thing you can do on all servers. Maybe other nice sysadmins on other clusters do this but it is not guaranteed)

To see all available modules to load, do `module avail`:

```
[obotvinnik@tscc-login2 ~]$ module avail

-------------------------- /opt/modulefiles/applications/.gnu ---------------------------
atlas/3.10.2(default)     hdf5/1.8.14(default)      scalapack/2.0.2(default)
boost/1.55.0(default)     lapack/3.5.0(default)     slepc/3.5.3(default)
fftw/2.1.5                netcdf/3.6.2              sprng/2.0b(default)
fftw/3.3.4(default)       netcdf/4.3.2(default)     sundials/2.5.0(default)
gsl/1.16(default)         parmetis/4.0.3(default)   superlu/3.3(default)
hdf4/2.10(default)        petsc/3.5.2(default)      trilinos/11.12.1(default)

------------------------------- /opt/modulefiles/mpi/.gnu -------------------------------
mvapich2_ib/2.1rc2(default) openmpi_ib/1.8.4(default)

------------------------- /opt/modulefiles/applications/.intel --------------------------
atlas/3.10.2(default)     lapack/3.5.0(default)     scalapack/2.0.2(default)
boost/1.55.0(default)     mxml/2.9(default)         slepc/3.5.3(default)
fftw/2.1.5                netcdf/3.6.2              sprng/2.0b(default)
fftw/3.3.4(default)       netcdf/4.3.2(default)     sundials/2.5.0(default)
gsl/1.16(default)         papi/5.4.1(default)       superlu/3.3(default)
hdf4/2.10(default)        parmetis/4.0.3(default)   tau/2.23(default)
hdf5/1.8.14(default)      pdt/3.20(default)         trilinos/11.12.1(default)
ipm/2.0.3(default)        petsc/3.5.2(default)

------------------------------ /opt/modulefiles/mpi/.intel ------------------------------
mvapich2_ib/2.1rc2(default) openmpi_ib/1.8.4(default)

---------------------------- /usr/share/Modules/modulefiles -----------------------------
dot              module-info      null             rocks-openmpi_ib
module-git       modules          rocks-openmpi    use.own

----------------------------------- /etc/modulefiles ------------------------------------
openmpi-x86_64

------------------------------ /opt/modulefiles/compilers -------------------------------
cilk/5.4.6(default)           intel/2015.2.164
cmake/3.2.1(default)          mono/3.12.0(default)
gnu/4.9.2(default)            pgi/14.9(default)
guile/2.0.11(default)         python/1(default)
intel/2013_sp1.2.144(default) upc/2.20.0(default)

----------------------------- /opt/modulefiles/applications -----------------------------
abyss/1.5.2(default)        fsa/1.15.9(default)         mpi4py/1.3.1(default)
amber/14(default)           gamess/2014.12(default)     namd/2.10(default)
apbs/1.3(default)           gaussian/09.D.01(default)   namd/2.9
bbcp/14.09.02.00.0(default) globus/5.2.5                nwchem/6.5(default)
bbftp/3.2.1(default)        gmp/6.0.0a(default)         octave/3.8.2(default)
beagle/2.1(default)         gnutools/2.69(default)      polymake/2.13.1(default)
beast/1.8.0                 gromacs/5.0.4(default)      R/3.2.1(default)
beast/1.8.1(default)        idl/8.4(default)            rapidminer/6.1.0(default)
beast2/2.1.3(default)       jags/3.4.0(default)         scipy/2.7(default)
bioroll/6.2(default)        lammps/20141209(default)    siesta/3.2.5(default)
biotools/1(default)         matlab/2013a                stata/13.1(default)
blcr/0.8.5(default)         matlab/2013b                vasp/4.6
cp2k/2.5.1(default)         matlab/2014a                vasp/5.2.12
cpmd/3.17.1(default)        matlab/2014b(default)       vasp/5.2.12.gamma
cuda/6.5.19(default)        mkl/11.1.2.144(default)     vasp/5.3.5(default)
ddt/4.2.2(default)          mpc/1.0.3(default)          vtk/6.1.0(default)
eigen/3.2.3(default)        mpfr/3.1.2(default)         weka/3.7.12(default)
```

To see everything that's loaded, look in `/opt/biotools/`:

```
$ ls /opt/biotools/
bamtools   blat       cufflinks  GenomeAnalysisTK  miRDeep2  randfold    spades       trinity
bedtools   bowtie     dendropy   gmap_gsnap        miso      rseqc       squid        velvet
biopython  bowtie2    edena      htseq             picard    samtools    stacks       ViennaRNA
bismark    bwa        fastqc     idba-ud           plink     soapdenovo  tophat
blast      bx-python  fastx      matt              pysam     SOAPsnp     trimmomatic
```

### Exercise 3: Use `bedtools flank` to get the promoters

Get 2000bp upstream from each gene. The genome we are using is "`hg19.genome`", provided in the data directory. Save this file as `gencode.v19.annotation.chr22.transcript.promoter.gtf`

Go to the documentation for [bedtools](http://bedtools.readthedocs.org/en/latest/) and find the entry for "`bedtools flank`".
`bedtools flank -h` is also informative (below)

``` 
$ bedtools flank -h

Tool:    bedtools flank (aka flankBed)
Version: v2.22.1
Summary: Creates flanking interval(s) for each BED/GFF/VCF feature.

Usage:   bedtools flank [OPTIONS] -i <bed/gff/vcf> -g <genome> [-b <int> or (-l and -r)]

Options: 
        -b      Create flanking interval(s) using -b base pairs in each direction.
                - (Integer) or (Float, e.g. 0.1) if used with -pct.

        -l      The number of base pairs that a flank should start from
                orig. start coordinate.
                - (Integer) or (Float, e.g. 0.1) if used with -pct.

        -r      The number of base pairs that a flank should end from
                orig. end coordinate.
                - (Integer) or (Float, e.g. 0.1) if used with -pct.

        -s      Define -l and -r based on strand.
                E.g. if used, -l 500 for a negative-stranded feature, 
                it will start the flank 500 bp downstream.  Default = false.

        -pct    Define -l and -r as a fraction of the feature's length.
                E.g. if used on a 1000bp feature, -l 0.50, 
                will add 500 bp "upstream".  Default = false.

        -header Print the header from the input file prior to results.

Notes: 
        (1)  Starts will be set to 0 if options would force it below 0.
        (2)  Ends will be set to the chromosome length if requested flank would
        force it above the max chrom length.
        (3)  In contrast to slop, which _extends_ intervals, bedtools flank
        creates new intervals from the regions just up- and down-stream
        of your existing intervals.
        (4)  The genome file should tab delimited and structured as follows:

        <chromName><TAB><chromSize>

        For example, Human (hg19):
        chr1    249250621
        chr2    243199373
        ...
        chr18_gl000207_random   4262

Tips: 
        One can use the UCSC Genome Browser's MySQL database to extract
        chromosome sizes. For example, H. sapiens:

        mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
        "select chrom, size from hg19.chromInfo"  > hg19.genome
````


In [20]:
%%bash --out exercise3
### BEGIN SOLUTION
bedtools flank -l 2000 -r 0 -s -i gencode.v19.annotation.chr22.transcript.gtf -g hg19.genome \
    > gencode.v19.annotation.chr22.transcript.promoter.gtf
### END SOLUTION

wc -l gencode.v19.annotation.chr22.transcript.promoter.gtf
echo '--- First 10 lines ---'
head gencode.v19.annotation.chr22.transcript.promoter.gtf
echo '--- Random 10 lines ---'
awk -v seed=907 'BEGIN{srand(seed);}{ if (rand() < 0.5 ) {print $0}}' gencode.v19.annotation.chr22.transcript.promoter.gtf | head
echo '--- Last 10 lines ---'
tail gencode.v19.annotation.chr22.transcript.promoter.gtf

The following cell is for your debugging purposes. Due to the way we have to set up the notebook to grade it, we add `--out exercise3` next to the `%%bash` magic to save the output of the previous cell to the variable `exercise3`, so here we're printing the output. This is what *would* be shown below the cell if we weren't doing this workaround. This is what is graded.

In [21]:
print(exercise3)

    4459 gencode.v19.annotation.chr22.transcript.promoter.gtf
--- First 10 lines ---
chr22	HAVANA	transcript	16060157	16062156	.	+	.	gene_id "ENSG00000233866.1"; transcript_id "ENST00000424770.1"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "LA16c-4G1.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.3-001"; level 2; tag "basic"; havana_gene "OTTHUMG00000140195.1"; havana_transcript "OTTHUMT00000276574.1";
chr22	HAVANA	transcript	16076173	16078172	.	-	.	gene_id "ENSG00000229286.1"; transcript_id "ENST00000448070.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-4G1.4"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.4-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140193.1"; havana_transcript "OTTHUMT00000276571.1";
chr22	HAVANA	transcript	16084827	16086826	.	-	.	gene_id "ENSG00000235265.1"; transcript_id "ENST00000413156.1"; gene_type "pseudogene"; gene_status "KNOWN"

In [22]:
answer3 = '''4459 gencode.v19.annotation.chr22.transcript.promoter.gtf
--- First 10 lines ---
chr22	HAVANA	transcript	16060157	16062156	.	+	.	gene_id "ENSG00000233866.1"; transcript_id "ENST00000424770.1"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "LA16c-4G1.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.3-001"; level 2; tag "basic"; havana_gene "OTTHUMG00000140195.1"; havana_transcript "OTTHUMT00000276574.1";
chr22	HAVANA	transcript	16076173	16078172	.	-	.	gene_id "ENSG00000229286.1"; transcript_id "ENST00000448070.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-4G1.4"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.4-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140193.1"; havana_transcript "OTTHUMT00000276571.1";
chr22	HAVANA	transcript	16084827	16086826	.	-	.	gene_id "ENSG00000235265.1"; transcript_id "ENST00000413156.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-4G1.5"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.5-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140197.1"; havana_transcript "OTTHUMT00000276576.1";
chr22	HAVANA	transcript	16124974	16126973	.	-	.	gene_id "ENSG00000223875.1"; transcript_id "ENST00000420638.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "NBEAP3"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "NBEAP3-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140196.1"; havana_transcript "OTTHUMT00000276575.1";
chr22	HAVANA	transcript	16120720	16122719	.	+	.	gene_id "ENSG00000215270.3"; transcript_id "ENST00000398242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-60H5.7"; transcript_type "processed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-60H5.7-001"; level 1; ont "PGO:0000004"; tag "pseudo_consens"; havana_gene "OTTHUMG00000140200.1"; havana_transcript "OTTHUMT00000276581.1";
chr22	HAVANA	transcript	16192972	16194971	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000447898.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276780.1";
chr22	HAVANA	transcript	16193001	16195000	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000437781.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-003"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276579.1";
chr22	HAVANA	transcript	16193005	16195004	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000413768.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-001"; level 2; tag "non_canonical_polymorphism"; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276578.1";
chr22	HAVANA	transcript	16192996	16194995	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000383038.3"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.9-004"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276577.1";
chr22	HAVANA	transcript	16159471	16161470	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000607933.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.9-006"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000472240.1";
--- Random 10 lines ---
chr22	HAVANA	transcript	16084827	16086826	.	-	.	gene_id "ENSG00000235265.1"; transcript_id "ENST00000413156.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-4G1.5"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-4G1.5-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140197.1"; havana_transcript "OTTHUMT00000276576.1";
chr22	HAVANA	transcript	16124974	16126973	.	-	.	gene_id "ENSG00000223875.1"; transcript_id "ENST00000420638.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "NBEAP3"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "NBEAP3-001"; level 2; ont "PGO:0000005"; havana_gene "OTTHUMG00000140196.1"; havana_transcript "OTTHUMT00000276575.1";
chr22	HAVANA	transcript	16120720	16122719	.	+	.	gene_id "ENSG00000215270.3"; transcript_id "ENST00000398242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-60H5.7"; transcript_type "processed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-60H5.7-001"; level 1; ont "PGO:0000004"; tag "pseudo_consens"; havana_gene "OTTHUMG00000140200.1"; havana_transcript "OTTHUMT00000276581.1";
chr22	HAVANA	transcript	16193001	16195000	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000437781.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-003"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276579.1";
chr22	HAVANA	transcript	16193005	16195004	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000413768.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "AP000525.9-001"; level 2; tag "non_canonical_polymorphism"; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276578.1";
chr22	HAVANA	transcript	16192996	16194995	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000383038.3"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.9-004"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000276577.1";
chr22	HAVANA	transcript	16159471	16161470	.	-	.	gene_id "ENSG00000206195.6"; transcript_id "ENST00000607933.1"; gene_type "processed_transcript"; gene_status "NOVEL"; gene_name "AP000525.9"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.9-006"; level 2; havana_gene "OTTHUMG00000185288.2"; havana_transcript "OTTHUMT00000472240.1";
chr22	HAVANA	transcript	16151398	16153397	.	-	.	gene_id "ENSG00000271672.1"; transcript_id "ENST00000456786.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DUXAP8"; transcript_type "transcribed_processed_pseudogene"; transcript_status "KNOWN"; transcript_name "DUXAP8-001"; level 1; ont "PGO:0000004"; ont "PGO:0000019"; tag "pseudo_consens"; havana_gene "OTTHUMG00000140194.3"; havana_transcript "OTTHUMT00000276572.3";
chr22	HAVANA	transcript	16160066	16162065	.	+	.	gene_id "ENSG00000232775.2"; transcript_id "ENST00000440946.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "AP000525.10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "AP000525.10-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000140198.3"; havana_transcript "OTTHUMT00000276785.1";
chr22	HAVANA	transcript	16256478	16258477	.	-	.	gene_id "ENSG00000241838.2"; transcript_id "ENST00000417657.1"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "LA16c-3G11.7"; transcript_type "processed_pseudogene"; transcript_status "KNOWN"; transcript_name "LA16c-3G11.7-001"; level 1; ont "PGO:0000004"; tag "pseudo_consens"; havana_gene "OTTHUMG00000140315.1"; havana_transcript "OTTHUMT00000276919.1";
--- Last 10 lines ---
chr22	HAVANA	transcript	51222091	51224090	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000395591.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "protein_coding"; transcript_status "NOVEL"; transcript_name "RABL2B-003"; level 2; tag "basic"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316608.1";
chr22	HAVANA	transcript	51222092	51224091	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000395595.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RABL2B-006"; level 2; tag "basic"; tag "appris_candidate_longest"; tag "CCDS"; ccdsid "CCDS33683.1"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316611.1";
chr22	HAVANA	transcript	51208931	51210930	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000465063.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "RABL2B-010"; level 2; tag "basic"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316615.1";
chr22	HAVANA	transcript	51222067	51224066	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000436958.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "RABL2B-002"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316607.1";
chr22	HAVANA	transcript	51214262	51216261	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000482308.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "RABL2B-013"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000349056.1";
chr22	HAVANA	transcript	51221715	51223714	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000464678.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "RABL2B-012"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000349055.1";
chr22	HAVANA	transcript	51222059	51224058	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000395590.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "RABL2B-004"; level 2; tag "basic"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316609.1";
chr22	HAVANA	transcript	51222029	51224028	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000468451.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "RABL2B-014"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000349057.1";
chr22	HAVANA	transcript	51222059	51224058	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000464740.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "RABL2B-015"; level 2; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000349058.1";
chr22	HAVANA	transcript	51221474	51223473	.	-	.	gene_id "ENSG00000079974.13"; transcript_id "ENST00000413505.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RABL2B"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RABL2B-011"; level 2; tag "alternative_5_UTR"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000150156.3"; havana_transcript "OTTHUMT00000316616.2";'''

# Remove whitespace at the beginning and end of the exercise 3 result
exercise3 = exercise3.strip()
assert exercise3 == answer3

### Exercise 4: Use `bedtools intersect` to overlap TFs with promoters

Take a look at the diagrams in the documentation for `bedtools intersect` and think about how it can answer these questions:

* Given locations of genome methylation, which genes does it overlap?
* Given locations of RBP binding, which exons does it overlap?
* Given two CHIP-Seq experiments, which peaks are consistent between them?

Use `bedtools intersect` to find which promoters overlap with the NFKB binding sites. Use the promoters as "A" and the binding sites as "B". Call this file `gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf`.

This file should have 740 lines.

In [23]:
%%bash --out exercise4
### BEGIN SOLUTION
bedtools intersect -a gencode.v19.annotation.chr22.transcript.promoter.gtf -b tf.nfkb.bed \
    > gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf
### END SOLUTION

wc -l gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf
echo '--- First 10 lines ---'
head gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf
echo '--- Random 10 lines ---'
awk -v seed=908 'BEGIN{srand(seed);}{ if (rand() < 0.5 ) {print $0}}' gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf | head
echo '--- Last 10 lines ---'
tail gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf

The following cell is for your debugging purposes. Due to the way we have to set up the notebook to grade it, we add `--out exercise4` next to the `%%bash` magic to save the output of the previous cell to the variable `exercise4`, so here we're printing the output. This is what *would* be shown below the cell if we weren't doing this workaround. This is what is graded.

In [24]:
print(exercise4)

    1221 gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf
--- First 10 lines ---
chr22	HAVANA	transcript	17565763	17565843	.	+	.	gene_id "ENSG00000177663.9"; transcript_id "ENST00000477874.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IL17RA"; transcript_type "processed_transcript"; transcript_status "PUTATIVE"; transcript_name "IL17RA-004"; level 1; tag "basic"; tag "exp_conf"; havana_gene "OTTHUMG00000150026.1"; havana_transcript "OTTHUMT00000315823.1";
chr22	HAVANA	transcript	17565763	17565848	.	+	.	gene_id "ENSG00000177663.9"; transcript_id "ENST00000319363.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IL17RA"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IL17RA-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS13739.1"; havana_gene "OTTHUMG00000150026.1"; havana_transcript "OTTHUMT00000315820.1";
chr22	HAVANA	transcript	17565763	17565946	.	+	.	gene_id "ENSG00000177663.9"; transc

In [25]:
answer4 = '''1221 gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf
--- First 10 lines ---
chr22	HAVANA	transcript	17565763	17565843	.	+	.	gene_id "ENSG00000177663.9"; transcript_id "ENST00000477874.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IL17RA"; transcript_type "processed_transcript"; transcript_status "PUTATIVE"; transcript_name "IL17RA-004"; level 1; tag "basic"; tag "exp_conf"; havana_gene "OTTHUMG00000150026.1"; havana_transcript "OTTHUMT00000315823.1";
chr22	HAVANA	transcript	17565763	17565848	.	+	.	gene_id "ENSG00000177663.9"; transcript_id "ENST00000319363.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IL17RA"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IL17RA-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS13739.1"; havana_gene "OTTHUMG00000150026.1"; havana_transcript "OTTHUMT00000315820.1";
chr22	HAVANA	transcript	17565763	17565946	.	+	.	gene_id "ENSG00000177663.9"; transcript_id "ENST00000459971.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IL17RA"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "IL17RA-003"; level 2; havana_gene "OTTHUMG00000150026.1"; havana_transcript "OTTHUMT00000315822.1";
chr22	HAVANA	transcript	17700267	17700502	.	-	.	gene_id "ENSG00000093072.11"; transcript_id "ENST00000399837.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CECR1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "CECR1-201"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS13742.1"; havana_gene "OTTHUMG00000030726.8"; havana_transcript "OTTHUMT00000316079.1";
chr22	HAVANA	transcript	17701962	17702266	.	-	.	gene_id "ENSG00000093072.11"; transcript_id "ENST00000399837.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CECR1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "CECR1-201"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS13742.1"; havana_gene "OTTHUMG00000030726.8"; havana_transcript "OTTHUMT00000316079.1";
chr22	HAVANA	transcript	17700326	17700502	.	-	.	gene_id "ENSG00000093072.11"; transcript_id "ENST00000543038.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CECR1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "CECR1-001"; level 2; tag "alternative_5_UTR"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000030726.8"; havana_transcript "OTTHUMT00000075614.5";
chr22	HAVANA	transcript	17701962	17702278	.	-	.	gene_id "ENSG00000093072.11"; transcript_id "ENST00000543038.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CECR1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "CECR1-001"; level 2; tag "alternative_5_UTR"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000030726.8"; havana_transcript "OTTHUMT00000075614.5";
chr22	HAVANA	transcript	18111762	18112038	.	-	.	gene_id "ENSG00000131100.8"; transcript_id "ENST00000253413.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ATP6V1E1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "ATP6V1E1-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS13745.1"; havana_gene "OTTHUMG00000059320.4"; havana_transcript "OTTHUMT00000131790.3";
chr22	HAVANA	transcript	18111762	18112038	.	-	.	gene_id "ENSG00000131100.8"; transcript_id "ENST00000399796.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ATP6V1E1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "ATP6V1E1-006"; level 2; tag "basic"; tag "CCDS"; ccdsid "CCDS42978.1"; havana_gene "OTTHUMG00000059320.4"; havana_transcript "OTTHUMT00000353464.1";
chr22	HAVANA	transcript	18111762	18112038	.	-	.	gene_id "ENSG00000131100.8"; transcript_id "ENST00000399798.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ATP6V1E1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "ATP6V1E1-005"; level 2; tag "basic"; tag "CCDS"; ccdsid "CCDS42977.1"; havana_gene "OTTHUMG00000059320.4"; havana_transcript "OTTHUMT00000353463.1";
--- Random 10 lines ---
chr22	HAVANA	transcript	17565763	17565848	.	+	.	gene_id "ENSG00000177663.9"; transcript_id "ENST00000319363.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IL17RA"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IL17RA-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS13739.1"; havana_gene "OTTHUMG00000150026.1"; havana_transcript "OTTHUMT00000315820.1";
chr22	HAVANA	transcript	17565763	17565946	.	+	.	gene_id "ENSG00000177663.9"; transcript_id "ENST00000459971.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IL17RA"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "IL17RA-003"; level 2; havana_gene "OTTHUMG00000150026.1"; havana_transcript "OTTHUMT00000315822.1";
chr22	HAVANA	transcript	17700267	17700502	.	-	.	gene_id "ENSG00000093072.11"; transcript_id "ENST00000399837.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CECR1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "CECR1-201"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS13742.1"; havana_gene "OTTHUMG00000030726.8"; havana_transcript "OTTHUMT00000316079.1";
chr22	HAVANA	transcript	18111762	18112038	.	-	.	gene_id "ENSG00000131100.8"; transcript_id "ENST00000399798.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ATP6V1E1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "ATP6V1E1-005"; level 2; tag "basic"; tag "CCDS"; ccdsid "CCDS42977.1"; havana_gene "OTTHUMG00000059320.4"; havana_transcript "OTTHUMT00000353463.1";
chr22	HAVANA	transcript	18121203	18121355	.	+	.	gene_id "ENSG00000099968.13"; transcript_id "ENST00000317582.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "BCL2L13"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "BCL2L13-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS13746.1"; havana_gene "OTTHUMG00000150088.3"; havana_transcript "OTTHUMT00000316184.1";
chr22	HAVANA	transcript	18121203	18121506	.	+	.	gene_id "ENSG00000099968.13"; transcript_id "ENST00000498133.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "BCL2L13"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "BCL2L13-003"; level 2; havana_gene "OTTHUMG00000150088.3"; havana_transcript "OTTHUMT00000316186.2";
chr22	ENSEMBL	transcript	18121203	18121522	.	+	.	gene_id "ENSG00000099968.13"; transcript_id "ENST00000538149.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "BCL2L13"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "BCL2L13-203"; level 3; tag "basic"; havana_gene "OTTHUMG00000150088.3";
chr22	ENSEMBL	transcript	18121203	18121529	.	+	.	gene_id "ENSG00000099968.13"; transcript_id "ENST00000337612.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "BCL2L13"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "BCL2L13-201"; level 3; tag "basic"; tag "CCDS"; ccdsid "CCDS59448.1"; havana_gene "OTTHUMG00000150088.3";
chr22	HAVANA	transcript	18121203	18121537	.	+	.	gene_id "ENSG00000099968.13"; transcript_id "ENST00000493680.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "BCL2L13"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "BCL2L13-004"; level 2; tag "alternative_5_UTR"; tag "basic"; havana_gene "OTTHUMG00000150088.3"; havana_transcript "OTTHUMT00000316187.2";
chr22	HAVANA	transcript	18225198	18225524	.	-	.	gene_id "ENSG00000015475.14"; transcript_id "ENST00000494097.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "BID"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "BID-005"; level 2; havana_gene "OTTHUMG00000150087.4"; havana_transcript "OTTHUMT00000316182.1";
--- Last 10 lines ---
chr22	HAVANA	transcript	51021250	51021316	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000479003.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-004"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317270.1";
chr22	HAVANA	transcript	51021478	51021754	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000479003.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-004"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317270.1";
chr22	HAVANA	transcript	51021040	51021316	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000468532.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-007"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317603.1";
chr22	HAVANA	transcript	51021478	51021754	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000468532.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-007"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317603.1";
chr22	HAVANA	transcript	51021040	51021209	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000489453.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-009"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317605.1";
chr22	HAVANA	transcript	51021284	51021316	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000476289.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-006"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317602.2";
chr22	HAVANA	transcript	51021478	51021754	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000476289.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-006"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317602.2";
chr22	HAVANA	transcript	51021040	51021316	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000465842.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-008"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317604.1";
chr22	HAVANA	transcript	51021478	51021754	.	-	.	gene_id "ENSG00000100288.15"; transcript_id "ENST00000465842.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "CHKB"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "CHKB-008"; level 2; havana_gene "OTTHUMG00000150275.5"; havana_transcript "OTTHUMT00000317604.1";
chr22	HAVANA	transcript	51021040	51021316	.	+	.	gene_id "ENSG00000205559.3"; transcript_id "ENST00000380711.3"; gene_type "antisense"; gene_status "NOVEL"; gene_name "CHKB-AS1"; transcript_type "antisense"; transcript_status "KNOWN"; transcript_name "CHKB-AS1-001"; level 1; tag "basic"; tag "exp_conf"; havana_gene "OTTHUMG00000150208.1"; havana_transcript "OTTHUMT00000316839.1";'''

# Remove whitespace at the beginning and end of the exercise 4 result
exercise4 = exercise4.strip()
assert exercise4 == answer4

#### Exercise 5: Use `bedtools getfasta` to extract sequences

Read the documentation for [`bedtools getfasta`](http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html) and figure out how to  request the sequences in `fasta` format for the peaks which overlap in gene promoters. Does the strand matter?

This file should have 1480 lines.

In [26]:
%%bash --out exercise5

### BEGIN SOLUTION
bedtools getfasta -s -bed gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf \
    -fi GRCh37.p13.chr22.fa -fo gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta
### END SOLUTION


wc -l gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta
echo '--- First 10 lines ---'
head gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta
echo '--- Random 10 lines ---'
awk -v seed=908 'BEGIN{srand(seed);}{ if (rand() < 0.5 ) {print $0}}' gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta | head
echo '--- Last 10 lines ---'
tail gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta

The following cell is for your debugging purposes. Due to the way we have to set up the notebook to grade it, we add `--out exercise5` next to the `%%bash` magic to save the output of the previous cell to the variable `exercise5`, so here we're printing the output. This is what *would* be shown below the cell if we weren't doing this workaround. This is what is graded.

In [27]:
print(exercise5)

    2442 gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta
--- First 10 lines ---
>chr22:17565762-17565843(+)
GCAAGGAGGAACGGGGGGAGGGCCCTGTCTCACCCTAAACCCCTCCCCGTCCCAGACTAAACTCCTCCCCTCGGCCCGGCC
>chr22:17565762-17565848(+)
GCAAGGAGGAACGGGGGGAGGGCCCTGTCTCACCCTAAACCCCTCCCCGTCCCAGACTAAACTCCTCCCCTCGGCCCGGCCCGCCC
>chr22:17565762-17565946(+)
GCAAGGAGGAACGGGGGGAGGGCCCTGTCTCACCCTAAACCCCTCCCCGTCCCAGACTAAACTCCTCCCCTCGGCCCGGCCCGCCCCTGGGCCCGGGCTGGAAGCCGGAAGCGAGCAAAGTGGAGCCGACTCGAACTCCACCGCGGAAAAGAAAGCCTCAGAACGTTCGTTCGCTGCGTCCCCA
>chr22:17700266-17700502(-)
TGAACATGCGCACAAGCCCCCCACATCAAACTCAGATGCACACAGCAAGCATGCATGTAAGCCACACACTTGCCACTCTGGCTGCCTCAGCACTGCAGAGCCCCTCTCCCCGGCCTCTAGTTGGATCCCTGGTGCACACTGGAGGGAAATAAAACCTACCCCAACAGGAAGTGAAACAGTTGGTGAGCTTTTCCGGTGCTCTGCACAGATGCTGGGGCGCTGAGCAAACAGCCCTC
>chr22:17701961-17702266(-)
AACCAACATGGCACAGGTATACCTATGTATCAAACCTGCACATTGTGCACATGTACCCTAGAACTTAAAGTATAAAAAAAACCCACAAAAAACCCTTCACATGATTTACTTTCAGAATTGGTGGTTTCCCTTTGTGCGGCGCTGGAATCAATCTTGTTTCTCCTTATTACTTGCGGT

In [28]:
answer5 = '''2442 gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta
--- First 10 lines ---
>chr22:17565762-17565843(+)
GCAAGGAGGAACGGGGGGAGGGCCCTGTCTCACCCTAAACCCCTCCCCGTCCCAGACTAAACTCCTCCCCTCGGCCCGGCC
>chr22:17565762-17565848(+)
GCAAGGAGGAACGGGGGGAGGGCCCTGTCTCACCCTAAACCCCTCCCCGTCCCAGACTAAACTCCTCCCCTCGGCCCGGCCCGCCC
>chr22:17565762-17565946(+)
GCAAGGAGGAACGGGGGGAGGGCCCTGTCTCACCCTAAACCCCTCCCCGTCCCAGACTAAACTCCTCCCCTCGGCCCGGCCCGCCCCTGGGCCCGGGCTGGAAGCCGGAAGCGAGCAAAGTGGAGCCGACTCGAACTCCACCGCGGAAAAGAAAGCCTCAGAACGTTCGTTCGCTGCGTCCCCA
>chr22:17700266-17700502(-)
TGAACATGCGCACAAGCCCCCCACATCAAACTCAGATGCACACAGCAAGCATGCATGTAAGCCACACACTTGCCACTCTGGCTGCCTCAGCACTGCAGAGCCCCTCTCCCCGGCCTCTAGTTGGATCCCTGGTGCACACTGGAGGGAAATAAAACCTACCCCAACAGGAAGTGAAACAGTTGGTGAGCTTTTCCGGTGCTCTGCACAGATGCTGGGGCGCTGAGCAAACAGCCCTC
>chr22:17701961-17702266(-)
AACCAACATGGCACAGGTATACCTATGTATCAAACCTGCACATTGTGCACATGTACCCTAGAACTTAAAGTATAAAAAAAACCCACAAAAAACCCTTCACATGATTTACTTTCAGAATTGGTGGTTTCCCTTTGTGCGGCGCTGGAATCAATCTTGTTTCTCCTTATTACTTGCGGTGCATTCTGCTTCCTCTAACTTTCAAAAAATTAGTGTTAAACTCTTTTTTTTTTTTTTGAGACAAAATCTCACTCTGTCATCCAGACTGGAGTGCAATGGTGCAATCTCAGCTCACTGCAACCTCCACC
--- Random 10 lines ---
GCAAGGAGGAACGGGGGGAGGGCCCTGTCTCACCCTAAACCCCTCCCCGTCCCAGACTAAACTCCTCCCCTCGGCCCGGCC
>chr22:17565762-17565848(+)
GCAAGGAGGAACGGGGGGAGGGCCCTGTCTCACCCTAAACCCCTCCCCGTCCCAGACTAAACTCCTCCCCTCGGCCCGGCCCGCCC
AACCAACATGGCACAGGTATACCTATGTATCAAACCTGCACATTGTGCACATGTACCCTAGAACTTAAAGTATAAAAAAAACCCACAAAAAACCCTTCACATGATTTACTTTCAGAATTGGTGGTTTCCCTTTGTGCGGCGCTGGAATCAATCTTGTTTCTCCTTATTACTTGCGGTGCATTCTGCTTCCTCTAACTTTCAAAAAATTAGTGTTAAACTCTTTTTTTTTTTTTTGAGACAAAATCTCACTCTGTCATCCAGACTGGAGTGCAATGGTGCAATCTCAGCTCACTGCAACCTCCACC
ATGGGTGCAGCAAACCAACATGGCACAGGTATACCTATGTATCAAACCTGCACATTGTGCACATGTACCCTAGAACTTAAAGTATAAAAAAAACCCACAAAAAACCCTTCACATGATTTACTTTCAGAATTGGTGGTTTCCCTTTGTGCGGCGCTGGAATCAATCTTGTTTCTCCTTATTACTTGCGGTGCATTCTGCTTCCTCTAACTTTCAAAAAATTAGTGTTAAACTCTTTTTTTTTTTTTTGAGACAAAATCTCACTCTGTCATCCAGACTGGAGTGCAATGGTGCAATCTCAGCTCACTGCAACCTCCACC
>chr22:18111761-18112038(-)
ACACTTTATTCAGGTAATCAAGGTTGACATGAATAGTCAGAAATCATGTATCCCCCTTTTTTTTTTCTTTTGGTAGAGACGGGGTTCCTCTATGTTGTCCAGCCTGGTCTCGAACTTGAGGTCAAGCGATACACCCGCCTTGGTCTCCCAAAGTCTCCGGATTGCAGGCGTGAGCCACCACGCCCGGCCAACAATCATGTACCCCTAAGCCCATAACCCTGTGTAATTATGCGAAAAATTCCAATAAAGGCACATTCTCCAATATACCTAACCAGTA
>chr22:18111761-18112038(-)
ACACTTTATTCAGGTAATCAAGGTTGACATGAATAGTCAGAAATCATGTATCCCCCTTTTTTTTTTCTTTTGGTAGAGACGGGGTTCCTCTATGTTGTCCAGCCTGGTCTCGAACTTGAGGTCAAGCGATACACCCGCCTTGGTCTCCCAAAGTCTCCGGATTGCAGGCGTGAGCCACCACGCCCGGCCAACAATCATGTACCCCTAAGCCCATAACCCTGTGTAATTATGCGAAAAATTCCAATAAAGGCACATTCTCCAATATACCTAACCAGTA
ACACTTTATTCAGGTAATCAAGGTTGACATGAATAGTCAGAAATCATGTATCCCCCTTTTTTTTTTCTTTTGGTAGAGACGGGGTTCCTCTATGTTGTCCAGCCTGGTCTCGAACTTGAGGTCAAGCGATACACCCGCCTTGGTCTCCCAAAGTCTCCGGATTGCAGGCGTGAGCCACCACGCCCGGCCAACAATCATGTACCCCTAAGCCCATAACCCTGTGTAATTATGCGAAAAATTCCAATAAAGGCACATTCTCCAATATACCTAACCAGTA
--- Last 10 lines ---
>chr22:51021283-51021316(-)
GAAGCCCAGGCCGGCCGGAAGAGGAGCCGAGCG
>chr22:51021477-51021754(-)
ACCCACGCTCAGGACGGACGCGCGCTGGACGGCTCTTCCTTGTCGGAGCGCCCCAGGGGTCGGGGAAGAGGGCCCGGCAAGGGAGCCCTCGCGCCGGAGCTGCAGCTGCAGCCGCCGCCCCGCCGCCCCGCCGGCTCCCACGGGGCAGAGACGCAGCTCCTCTCCGGTCTTCCCGTACGCTACCGCGCCCGGGCAGTTCCTCGCCCGCGCACGCGCCGCTCCGCCAACTGATTGGCCTCCGGCGCCTCGGATTGGCCCAGGCCGTCCAACAGCAGCC
>chr22:51021039-51021316(-)
GAAGCCCAGGCCGGCCGGAAGAGGAGCCGAGCGCGGCCGGAAGGAACCGAGCCCGTCCGAAGGGAGCGGAGCGCAGCCTGGCCTGGGGCCCGGTCGAGCCCGCGCCATGGCGGCCGAGGCGACAGCTGTGGCCGGAAGCGGGGCTGTTGGCGGCTGCCTGGCCAAAGACGGCTTGCAGCAGTCTAAGTGCCCGGACACTACCCCAAAACGGCGGCGCGCCTCGTCGCTGTCGCGTGACGCCGAGCGCCGAGCCTACCAATGGTGCCGGGAGTACTTG
>chr22:51021477-51021754(-)
ACCCACGCTCAGGACGGACGCGCGCTGGACGGCTCTTCCTTGTCGGAGCGCCCCAGGGGTCGGGGAAGAGGGCCCGGCAAGGGAGCCCTCGCGCCGGAGCTGCAGCTGCAGCCGCCGCCCCGCCGCCCCGCCGGCTCCCACGGGGCAGAGACGCAGCTCCTCTCCGGTCTTCCCGTACGCTACCGCGCCCGGGCAGTTCCTCGCCCGCGCACGCGCCGCTCCGCCAACTGATTGGCCTCCGGCGCCTCGGATTGGCCCAGGCCGTCCAACAGCAGCC
>chr22:51021039-51021316(+)
CAAGTACTCCCGGCACCATTGGTAGGCTCGGCGCTCGGCGTCACGCGACAGCGACGAGGCGCGCCGCCGTTTTGGGGTAGTGTCCGGGCACTTAGACTGCTGCAAGCCGTCTTTGGCCAGGCAGCCGCCAACAGCCCCGCTTCCGGCCACAGCTGTCGCCTCGGCCGCCATGGCGCGGGCTCGACCGGGCCCCAGGCCAGGCTGCGCTCCGCTCCCTTCGGACGGGCTCGGTTCCTTCCGGCCGCGCTCGGCTCCTCTTCCGGCCGGCCTGGGCTTC'''

# Remove whitespace at the beginning and end of the exercise 4 result
exercise5 = exercise5.strip()
assert exercise5 == answer5

#### Exercise 6: Computational "validation" of binding sites
Since we're wearing our computational biologist hat and don't do wet-lab experiments, we can't validate this finding by doing another ChIP-seq experiment. Instead, we'll validate by looking at another dataset.

Look at the contents of `gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta`. How do sequences compare to the known binding sites as reported by [this](http://www.genomebiology.com/2011/12/7/R70) paper?

Your text here