# Creating Heatmaps from any Sequencing Experiment

This notebook contains python code which I use to construct heatmaps comparing samples in different Sequencing Experiments. The notebook `R Code` contains the code used to construct heatmaps using the R package Pheatmap and edgeR for normalization.

**Note** All this code can also be written in R. However, I prefer using Python as much as I can, which is why I wrote this in Python.

## Creating a list of Consensus Sequences

With RNA-seq Heatmaps, each row represents a gene, making alignment of RNA-seq data to a heatmap relatively straightforward. However, with other sequencing experiments such as ATAC-seq, there are no such strict boundaries on what constitutes the rows of the heatmap. I have personally experimented with a variety of different possible ways to delimit the row in a heatmap, including tiling across the genome, and using each tile as a row, and creating a general consensus sequence of overlap between sample sequences. And I have found that using a consensus sequence provides the best results.

While there are no doubt many ways to go about creating this list, the most straightforward method that I have found is to use bedtools and a custom python script to find all regions of overlap, and save the read counts as a matrix.

### Let's get some data to play with!

I am going to download [heart](https://www.encodeproject.org/experiments/ENCSR820ACB/) and [brain](https://www.encodeproject.org/experiments/ENCSR273UFV/) BAM files from ENCODE to use as tutorial data. BAM peaks will then be called so I can get the genomic loci for each file and the read counts at each loci. There are a variety of different programs which call peaks, so use whichever method works best for you. Here are some other methods which you can use:

* [Genrich](https://github.com/jsh58/Genrich)
* [MACS2](https://github.com/taoliu/MACS)
* [HOMER](http://homer.ucsd.edu/homer/ngs/peaks.html)
* [HMMRATAC](https://github.com/LiuLabUB/HMMRATAC)

Genrich and MACS2 are the current standards, though Genrich has not been published yet. I will be using Genrich in this tutorial, but Galaxy also has a great [tutorial](https://galaxyproject.github.io/training-material/topics/epigenetics/tutorials/atac-seq/tutorial.html#peak-calling) on MACS2 if you want to use that instead.

In [4]:
import pandas as pd
import subprocess, os

In [3]:
def run_genrich(bam,outpath,path=0):
    # given some bam, run genrich and send narrowPeak and bed output to output
    # path to genrich script can be specified. Otherwise, will assume that `Genrich` is the path
    if '/' in bam:
        header = bam.split('/')[-1].split('.bam')[0]
    elif '.bam' in bam:
        header = bam.split('.bam')[0]
    else:
        print("ERROR, please use a valid BAM file")
    if outpath[-1] != '/':
        outpath += '/'
        
    outnarrow = outpath + header + '.narrowPeak'
    outbed = outpath + header + '.bed'
    if path == 0:
        path = 'Genrich'
        
    subprocess.run(path+ ' -t ' + bam + ' -o ' + outnarrow + " -b " + outbed + " -r -v")

In [7]:
bamdir = '/mnt/labshare/chromatin-datasets/'
filenames = ['liver-p1.sorted.bam', 'liver-p2.sorted.bam','brain-p1.sorted.bam','brain-p2.sorted.bam']

for i in os.listdir(bamdir):
    if i in filenames:
        print(bamdir+i)

/mnt/labshare/chromatin-datasets/brain-p1.sorted.bam
/mnt/labshare/chromatin-datasets/liver-p1.sorted.bam
/mnt/labshare/chromatin-datasets/liver-p2.sorted.bam
