Pratical Data Science Tutorial Project

# pybedtools Tutorial

## Introduction

With the development of technology, biology is has been greatly affected by this change, allowing us to use computational models and techniques to further advance the field. One area of the field that is influenced by this change is genomics, which is the study of the collective interaction of the DNA and genes of an organism on a broader level. This is not to be confused with genetics, which looks at the effects of indivudal genes. 

`bedtools` is a revolutionary shell scripting toolset which allows for genome arthimatic and calculations. On their website, they refer it as a *"swiss-army knife of tools for a wide-range of genomics analysis tasks"*. This package is used by many from various disciplines for quick and efficient analysis. 

`pybedtools` is essentially a wrapper for `bedtools` in *Python*. You might be wondering now, *"`bedtools` is powerful, efficient, and easy to use using command line and `bash`. Why do we even need a Python version?"* Well, some might agree with this statement. However, that is not entirely the case. `pybedtools` allows for seemless integration of the `bedtools` toolset to Python, which is a language that many poeple know and understand well. This allows users to not know the intricacies of other shell scripting and programming language, such as `Perl`, `bash`, `awk`, which needed for more complicated uses of the package. Even better, run time is conserved for both `pybedtools` and `bedtools`. AWESOME!!!

This tutorial will overview the basic, but important, methods and examples of genome analysis using the various functions of `pybedtools`. 

## Installing the Libraries

In order to use `pybedtools`, `bedtools` must also be installed locally on the computer. There are many methods to install `bedtools`. Installing directly from GitHub is a popular and recommended option: https://github.com/arq5x/bedtools2. 

However, instructions for other methods can also be found here: http://bedtools.readthedocs.io/en/latest/content/installation.html.

To install `pybedtools`, `pip` can be used:

To make sure the packages are installed properly, make sure the following command work properly:

In [1]:
import pybedtools 

## Loading the Sample Data

To describe the basic `bedtools` functions, we will be using sample data provided by the `pybedtools` library. The file format that we will use initially is called `bed`. This file type is used to store instances of genome sequences from chromosomes of humans, mice, rats, etc. 

At the very least, these files store intervals of genome sequences that have three fields: the chromosome number (*chr#*), the start position of the feature, and the end position of the feature. However, more features can be stored, including the name of the feature, the score, and the direction of the strand. For more information, please visit: https://useast.ensembl.org/info/website/upload/bed.html

The following shows how to extract the example files, and how the `bed` sample data is formatted. We use the function, `example_bedtool`, because the file is an example stored within the library; however, to access files that are not in the library, the function `BedTool` suffices. 

In [2]:
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
print(type(a))
print(a)
print(b)
print(b[0])

<class 'pybedtools.bedtool.BedTool'>
chr1	1	100	feature1	0	+
chr1	100	200	feature2	0	+
chr1	150	500	feature3	0	-
chr1	900	950	feature4	0	+

chr1	155	200	feature5	0	-
chr1	800	901	feature6	0	+

chr1	155	200	feature5	0	-



We see here that there are three fields in addition to the three required fields, which are described above. Even though the output is a `BedTool` object, we can call extract each line individual for element extraction (for parsing through files).

There are also other file types that many of the functions that `bedtools` and `pybedtools` can be use, including `vcf` and `fasta`. However, for the purposes of this tutorial, `bed` files will be used for the function examples. 

## Essential `bedtools` Functions and Examples

In this section, we will discuss many important `bedtools` functions, their implementation in `pybedtools`, and some examples. 

Note: For this section, file `A` refers to `a.bed` and file `B` refers to `b.bed`. 

### Intersect

`Intersect` is the epitome and one of the most powerful functions of `bedtools`. Given two `bed` files, it outputs all the regions where the features overlap. 

In [3]:
a_with_b = a.intersect(b, u = True)
a_with_not_b = a.intersect(b)
print(a_with_b)
print(a_with_not_b)

chr1	100	200	feature2	0	+
chr1	150	500	feature3	0	-
chr1	900	950	feature4	0	+

chr1	155	200	feature2	0	+
chr1	155	200	feature3	0	-
chr1	900	901	feature4	0	+



We see that the value of the `u` parameter is important. When `u` is `True`, it outputs the intersection of `A` with respect to `B`. Thus, we see all the values of `A`, but only those that are included in to some extent in `B`. For more clarification, `(100,200)`, the first data point, is not in the true intersection between `A` and `B`, as it should be `(155,200)`. But, since this range in `A` is a part of `B`, it includes all of `A`. However, it does not include `(1,100)`, because none of those values are within any ranges in `B`. On the other hand, when `u` is not defined (`False`), the result is the true intersection between `A` and `B`.

### Merge

Merge allows us to merge any overlapping regions within a file. We introduce the argument `s`, which is common for most functions. `s` determines whether to take into account the specific strand direction, and evaluates the function seperating the strands. Hence, this is why we see `(150,500)` when `s = True`.

In [4]:
a_merge = a.merge()
a_merge_strand = a.merge(s = True)
print(a_merge)
print(a_merge_strand)

chr1	1	500
chr1	900	950

chr1	1	200
chr1	150	500
chr1	900	950



### Subtract

Given `A` and `B`, `subtract` allows us to remove any sequences of `B` in `A`. An additional example is also provided  to demonstrate the `s` parameter. 

In [5]:
a_subtract_b = a.subtract(b)
a_subtract_b_strand = a.subtract(b, s = True)
print(a_subtract_b)
print(a_subtract_b_strand)

chr1	1	100	feature1	0	+
chr1	100	155	feature2	0	+
chr1	150	155	feature3	0	-
chr1	200	500	feature3	0	-
chr1	901	950	feature4	0	+

chr1	1	100	feature1	0	+
chr1	100	200	feature2	0	+
chr1	150	155	feature3	0	-
chr1	200	500	feature3	0	-
chr1	901	950	feature4	0	+



### Complement

`Complement` takes the opposite intervals from the input file. 

The `genome` parameter is necessary for this function. The genome field specifies the genome species and version. For example, `hg38`: `hg` refers to the human genome, and `38` is the human reference genome number. Another example is `mm10`, which refers to the `mice` reference genome number `10`. Below, we see that the there are differences in the output for the three genomes, even between the two human genomes. This is because the genomes are different between animals and are constantly updated. Thus, this field is important when working with functions that require information that is not provided through the current file.

In [6]:
a_complement_hg38 = a.complement(genome = "hg38")
a_complement_hg19 = a.complement(genome = "hg19")
a_complement_mm = a.complement(genome = "mm10")
a_complement_hg38.head(5)
print("\n")
a_complement_hg19.head(5)
print("\n")
a_complement_mm.head(5)

chr1	0	1
 chr1	500	900
 chr1	950	248956422
 chr10	0	133797422
 chr10_GL383545v1_alt	0	179254
 

chr1	0	1
 chr1	500	900
 chr1	950	249250621
 chr10	0	135534747
 chr11	0	135006516
 

chr1	0	1
 chr1	500	900
 chr1	950	195471971
 chr10	0	130694993
 chr11	0	122082543
 

### Closest

Given `A` and `B`, `closest` returns the nearest features of `B` to `A`. Below is an example from the sample data.

In [7]:
a_closest_b = a.closest(b, s = True, d = True)
print(a_closest_b)

chr1	1	100	feature1	0	+	chr1	800	901	feature6	0	+	701
chr1	100	200	feature2	0	+	chr1	800	901	feature6	0	+	601
chr1	150	500	feature3	0	-	chr1	155	200	feature5	0	-	0
chr1	900	950	feature4	0	+	chr1	800	901	feature6	0	+	0



The `d` parameter (specific to `closest`) introduces another column to the output which calculates the distance from the original feature to the nearest feature. This can be helpful in filtering out sequences that are too either close or too far from each other.

### Sort

`Sort` is an important tool because it can sort input file based on many arguments, which are but not limited to:
* No arguments = sort by starting position
* `-size D` = sort by descending length
* `-size A` = sort by ascending length
* `-chrThenSizeD` = sort by chromosome than by descending length
* `-chrThenSizeA` = sort by chromosome than by ascending length
* `-faidx (names.txt)` = sort chromsomes by a file with the name `names.txt`

One thing to note is that chromsome storing is lexicographical. In order to bypass this, the last argument above can be used. As `A` and `B` are both sorted, we will use the `a_complement_hg38` stored previously to demonstrate this function below.

In [8]:
a_complement_hg38.sort().head(5)
print("\n")
a_complement_hg38.sort(sizeD = True).head(5)
print("\n")
a_complement_hg38.sort(sizeA = True).head(5)
print("\n")
a_complement_hg38.sort(chrThenSizeD = True).head(5)

chr1	0	1
 chr1	500	900
 chr1	950	248956422
 chr10	0	133797422
 chr10_GL383545v1_alt	0	179254
 

chr1	950	248956422
 chr2	0	242193529
 chr3	0	198295559
 chr4	0	190214555
 chr5	0	181538259
 

chr1	0	1
 chr1	500	900
 chrUn_KI270394v1	0	970
 chrUn_KI270392v1	0	971
 chrUn_KI270423v1	0	981
 

chr1	950	248956422
 chr1	500	900
 chr1	0	1
 chr10	0	133797422
 chr10_GL383545v1_alt	0	179254
 

### Slop

Given an input file `A`, `slop` allows us to increase the read range of the algorithm by a certain value. Below is an example from the sample data, where we increase the reads in each direction by a value of `100`.

In [9]:
a_extended_hg = a.slop(b=100, genome='hg19') #The genome parameter is required. 
print(a_extended_hg)

chr1	0	200	feature1	0	+
chr1	0	300	feature2	0	+
chr1	50	600	feature3	0	-
chr1	800	1050	feature4	0	+



### Genomecov

Another function is `genomecov`, which computes the frequency of each interval in the given file. 

In [10]:
a_gencov = a.genome_coverage(bg = True, genome ='hg19') #The genome parameter is required. 
a_gencov.head()

chr1	1	150	1
 chr1	150	200	2
 chr1	200	500	1
 chr1	900	950	1
 

## The Genome File Types and Conversions

One of the most annoying, yet common, theme about genomes file is that there are many different file types to store information. Though, these file types are useful for their specific purposes, there are many times where it is needed to convert between file types. This section describes other common file types, outside of `bed`, which are important. 

### VCF

VCF is an text file type that is commonly used to store variant/mutation information. Each line stores the chromosome number, position, the reference allele, the alternate allels, the quality of the reads. For more information: http://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/.

In [11]:
v = pybedtools.example_bedtool('v.vcf')
print("#CHROM  POS ID  REF ALT QUAL    FILTER  INFO") # Just useful to look at the structure of the VCF file
print(v)

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr1	14	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
chr1	17	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
chr1	111	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
chr1	223	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
chr1	294	microsat1	GTCT	G,GTACT	50	PASS	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3



### FASTA

The `fasta` file type, which stores the actual sequence of the interval in nucleotide base pairs (`A`,`T`,`G`,`C`). Below is an example of how we extract the `fasta` file from a `bed` file using the function `seqence`, which is the wrapper for the `bedtools` function `getfasta`. We use the argument, `fi`, to specify the input `fasta` file.

In [12]:
fasta = pybedtools.example_filename('test.fa')
a_fasta = a.sequence(fi=fasta)
print(open(a_fasta.seqfn).read())

>chr1:1-100
GATGAGTCTGCATGATTTTTATCTTAAATAAATCTGACCCAGGCTGCATCCATCCTGTTCTGACAGTAGAAAGGCATACACACTTTTACTTAAAGCTTT
>chr1:100-200
CATATCTAAAATTGATGGGACATCTGAAAATAAGGGAATTGGGTATATGAAAAGATCCTTTAAAAAACTGCTCTCATCTTTCACATTATCTTCTTTTTAA
>chr1:150-500
AAAGATCCTTTAAAAAACTGCTCTCATCTTTCACATTATCTTCTTTTTAACTAAGTGCAATACAGTGAACAGCAAAATAGCAGCATTTTTAGAGCAACCATTGGGGAAAATTCTTAGATCAGGGGACCTGCTGTCCTGACAGAGTCTATATTCGGCCAGCAAAGGCTCCCTAAATGTGTATCTGAGACTTATTTGAAAACAAGTCACCATAGTACTGAGTACTGAATTGAGTGGCAAACACTCCACTTGCAGTCAATGAAAATCCAATTACACAATATACACAAAGAGATGACCAAGAAAAACATGATTTAAAATTATACTGACAAAAACTAAAATCATATGAAGTAAAA
>chr1:900-950
TATGCCCTTTGCAGAAAAGAGTGAATTGGTGTCAAATTTGGGTTGATTAC



Each new sequence is started with `>`, followed by a header, which is seperated by `:`. Here, we see that the chromosome number and the sequence information is stored. The `strand` field is important (shown below) because the sequence of amino acids is opposite for the `+` and `-` strands (the nucleotide bases `A` binds with `T` and `G` binds with `C`). Below, force strandness does output the opposite result for the interval between 150 and 500. 

In [13]:
a_fasta_strand = a.sequence(fi=fasta, s = True)
print(open(a_fasta_strand.seqfn).read())

>chr1:1-100(+)
GATGAGTCTGCATGATTTTTATCTTAAATAAATCTGACCCAGGCTGCATCCATCCTGTTCTGACAGTAGAAAGGCATACACACTTTTACTTAAAGCTTT
>chr1:100-200(+)
CATATCTAAAATTGATGGGACATCTGAAAATAAGGGAATTGGGTATATGAAAAGATCCTTTAAAAAACTGCTCTCATCTTTCACATTATCTTCTTTTTAA
>chr1:150-500(-)
TTTTACTTCATATGATTTTAGTTTTTGTCAGTATAATTTTAAATCATGTTTTTCTTGGTCATCTCTTTGTGTATATTGTGTAATTGGATTTTCATTGACTGCAAGTGGAGTGTTTGCCACTCAATTCAGTACTCAGTACTATGGTGACTTGTTTTCAAATAAGTCTCAGATACACATTTAGGGAGCCTTTGCTGGCCGAATATAGACTCTGTCAGGACAGCAGGTCCCCTGATCTAAGAATTTTCCCCAATGGTTGCTCTAAAAATGCTGCTATTTTGCTGTTCACTGTATTGCACTTAGTTAAAAAGAAGATAATGTGAAAGATGAGAGCAGTTTTTTAAAGGATCTTT
>chr1:900-950(+)
TATGCCCTTTGCAGAAAAGAGTGAATTGGTGTCAAATTTGGGTTGATTAC



## Example Application: Promoters affected by SNPs

Now, we are ready to do some heavy genome arthimatic and use the full force if `pybedtools`! In this final section, we will use what we have learned, and then some, to find some cool results. 

Our goal is figure out the concentration of SNPs (single nucleotide polymorphisms/mutations) that are near DNA promotor binding sites. Even one amino acid mutation can change the affinity of the protein binding to the bind site, modifying this entire pathway. Let us see what happens!

### Loading the Data

For this example, download the files from this folder to the same directory as this notebook: https://drive.google.com/drive/folders/1zs2YkDd--BR2PpeDAX4WkPl3Gu-1vi-o?usp=sharing

In [14]:
from pybedtools import BedTool
chromHmm = BedTool('hesc.chromHmm.bed')
promoter = BedTool('RefGene_hg38_wg_reg.bed')
exons = BedTool('hg38_exons.bed')
vcf = BedTool('test1_truth.vcf')

The *`chromHmm`* file is the genome split into regions that describe what each regions does. The *`promotor`* and *`exon`* files is the location of all promotors and exons in the genome. The *`vcf`* file is the mutations at each location. Below is some example output to show the structure of these files.

Note: The *`vcf`* file contains only sample mutations. The actual files are too large for our purposes.

In [15]:
chromHmm.head(3)

chr1	10000	10600	15_Repetitive/CNV
 chr1	10600	11137	13_Heterochrom/lo
 chr1	11137	11537	8_Insulator
 

In [16]:
promoter.head(3)

chr1	201283451	201332993	NM_000299	0	+
 chr1	67092175	67134971	NM_001276352	0	-
 chr1	67092175	67134971	NM_001276351	0	-
 

In [17]:
exons.head(3)

chr1	67092175	67093604	NR_075077_exon_0_0_chr1_67092176_r	0	-
 chr1	67096251	67096321	NR_075077_exon_1_0_chr1_67096252_r	0	-
 chr1	67103237	67103382	NR_075077_exon_2_0_chr1_67103238_r	0	-
 

In [18]:
vcf.head(3)

1	1168006	.	C	G	100	PASS	.	GT	0|1
 1	1168010	.	GA	G	100	PASS	.	GT	0|1
 1	1273408	.	C	A	100	PASS	.	GT	0|1
 

### Subtract the Exons from the Promotor Sequences 

At an extremly high level, DNA is transcripted and translated to RNA to create protein. In order for this process to start, a protein needs to bind to the DNA for transcription to start. This binding site on the DNA is called a ***promoter***. Because this binding site is not necessary for created protein, during the processing stage before translation, non-coding regions (introns) are removed and coding regions (***exons***) are preserved.

Because the promoter regions are difficult to define without additional informations, we will define it as the regions that are `-500` upstream base pairs from the transcriptional start site (*r = 500, l = 0*). Thus, we can have accurate results when we intersect it with the ***chromHmm*** file. Then, we keep the intervals feature that are promotors. 

In [19]:
exons_sorted = exons.sort()
exons_sorted_merge = exons_sorted.merge()
exons_sorted_merge.head(3)

chr1	11873	12227
 chr1	12612	12721
 chr1	13220	14829
 

In [20]:
promoter = promoter.sort()
promoter_slop = promoter.slop(r = 500, l = 0, s = True, genome='hg38')
chromHmm_promoter = chromHmm.intersect(promoter_slop).sort()

In [21]:
chromHmm_promoter_noexons = chromHmm_promoter.subtract(exons, s = True)
chromHmm_promoter_noexons.head(3)

chr1	11873	11937	11_Weak_Txn
 chr1	11937	12137	14_Repetitive/CNV
 chr1	12137	14137	11_Weak_Txn
 

In [22]:
# Take only the intervals that are promoters
def seperate_promoters(bed_lines):
    lines = []
    for bed in bed_lines:
        line = bed.fields
        if (line[3].find("Promoter") != -1):
            new_line = "\t".join(line)
            if new_line not in lines:
                lines.append("\t".join(line))
    return lines  

seperate_result = seperate_promoters(chromHmm_promoter_noexons)
chromHmm_promoter_noexons_filtered = pybedtools.BedTool("\n".join(seperate_result), from_string = True)
chromHmm_promoter_noexons_filtered.head(5)

chr1	27737	28537	2_Weak_Promoter
 chr1	28537	29370	1_Active_Promoter
 chr1	30537	30737	3_Poised_Promoter
 chr1	839337	841137	3_Poised_Promoter
 chr1	855737	857137	3_Poised_Promoter
 

### Seperate Categorize Mutations 

Below, we seperate the variants in ***vcf*** file into different mutation types.

In [23]:
# Header is required for VCF files
header = """##fileformat=VCFv4.0
##fileDate=Jun 29, 2010
##INFO=<ID=NS,Number=1,Type=Integer>
##INFO=<ID=DP,Number=1,Type=Integer>
##INFO=<ID=AC,Number=1,Type=Integer>
##INFO=<ID=AN,Number=1,Type=Integer>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=VR,Number=1,Type=Integer,Description="Number of variant reads">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=FT,Number=.,Type=String,Description="Filter">
##FILTER=<ID=atlas>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO"""

#Check if the variant type is a repeat (duplication and repeats)
def check_repeat(ref, alt):
    if (len(alt) % len(ref) != 0): return (False, 0)
    for base in range(len(alt)):
        if (ref[base % len(ref)] != alt[base]): return (False, 0)
    return (True, len(alt)/len(ref))

#Check if the variant type is a deletion and insertion
def check_delins(ref, alt):
    if (len(ref) > len(alt)): a,b = ref,alt
    else: a,b = alt,ref
    for base in range(len(b)):
        if (b[base] != a[base]): return True
    return False

#Seperate the variants by mutation type
def seperate_by_type(vcf_input_lines):
    result = {"SNP":[],"DEL":[],"INS":[],"DELINS":[],"DUP":[],"REP":[]}
    #Seperate the variants
    line_num = 0
    try:
        while True:
            line = vcf_input_lines[line_num]
            chrm,ref,alts = line[0],line[3],line[4].split(",") #VCF files have multiple variants 
        
            for alt in alts:
                line[0] = "chr" + chrm 
                line[4] = alt
                vcf_line = "\t".join(line)
                dup,num = check_repeat(ref,alt) #Check for duplication
                if (dup == True): #Repetition or duplication exists
                    if (num == 2): #Duplication
                        result[4].append(vcf_line)
                    else: #Repetition
                        result[5].append(vcf_line)
                elif (len(ref) == len(alt) == 1): #SNV
                        result["SNP"].append(vcf_line)
                elif (check_delins(ref,alt) == True): #Deletion and insertion
                    result["DELINS"].append(vcf_line)
                elif (len(ref) > len(alt)): #Deletion
                    result["DEL"].append(vcf_line)
                elif (len(ref) < len(alt)): #Insertion
                    result["INS"].append(vcf_line)
                else: #Unknown mutation and is "skipped"
                    pass
            line_num += 1
    except:
        return result

vcf_result = seperate_by_type(vcf)

### Converting to `VCF` and `BED` Format

Now, as we only concerned with specifc mutations, let us convert the dictionary into `VCF` files, so we can use the `pybedtools` function. Just to make sure our algorithm is correct, we output the results below.

In [24]:
SNP = header + "\n" + "\n".join(vcf_result["SNP"])
SNP_VCF = pybedtools.BedTool(SNP, from_string = True)
SNP_VCF.head(3)

chr1	1168006	.	C	G	100	PASS	.	GT	0|1
 chr1	1273408	.	C	A	100	PASS	.	GT	0|1
 chr1	2160480	.	T	C	100	PASS	.	GT	0|1
 

In [25]:
DEL = header + "\n" + "\n".join(vcf_result["DEL"])
DEL_VCF = pybedtools.BedTool(DEL, from_string = True)
DEL_VCF.head(3)

chr1	1168010	.	GA	G	100	PASS	.	GT	0|1
 chr1	2160484	.	GTCCGACCGC	G	100	PASS	.	GT	0|1
 chr1	11856363	.	GTGA	G	100	PASS	.	GT	0|1
 

In [26]:
INS = header + "\n" + "\n".join(vcf_result["INS"])
INS_VCF = pybedtools.BedTool(INS, from_string = True)
INS_VCF.head(3)

chr1	2338230	.	C	CT	100	PASS	.	GT	0|1
 chr1	17355184	.	C	CTAGA	100	PASS	.	GT	0|1
 chr1	26126761	.	C	CCCGGCCCCGCCGCGCAGCCTCCCGCGCCA	100	PASS	.	GT	0|1
 

In [27]:
DELINS = header + "\n" + "\n".join(vcf_result["DELINS"])
DELINS_VCF = pybedtools.BedTool(DELINS, from_string = True)
DELINS_VCF.head(3)

chr1	1273412	.	GTAGGCAGG	GC	100	PASS	.	GT	0|1
 chr1	40557791	.	ACACACTGTTGTTACTTG	AAA	100	PASS	.	GT	0|1
 chr1	114445285	.	CCA	CG	100	PASS	.	GT	0|1
 

### Find the SNPs that Affect Each Promotor

Final step! Let us find the intervals of the promotors that interesect with the SNPs.

In [28]:
promoter_intersect_snp = chromHmm_promoter_noexons_filtered.intersect(SNP_VCF, u = True)
promoter_intersect_snp.head(10)

chr1	1167937	1168137	2_Weak_Promoter
 chr1	2160140	2163540	2_Weak_Promoter
 chr1	40723813	40724413	1_Active_Promoter
 chr1	155210376	155211176	2_Weak_Promoter
 chr1	156084576	156085176	1_Active_Promoter
 chr1	161274776	161276576	2_Weak_Promoter
 chr1	165737976	165738776	1_Active_Promoter
 chr1	167408176	167409376	3_Poised_Promoter
 chr1	197115177	197116177	1_Active_Promoter
 chr1	218519377	218520377	3_Poised_Promoter
 

In [29]:
promoter_intersect_snp_range = chromHmm_promoter_noexons_filtered.intersect(SNP_VCF, u = True)
promoter_intersect_snp_range.head(5)

chr1	1167937	1168137	2_Weak_Promoter
 chr1	2160140	2163540	2_Weak_Promoter
 chr1	40723813	40724413	1_Active_Promoter
 chr1	155210376	155211176	2_Weak_Promoter
 chr1	156084576	156085176	1_Active_Promoter
 

If we look at the first active promoter in the UCSC Genome Browser (https://genome.ucsc.edu/cgi-bin/hgGateway, entry: *"chr1	40723813	40724413"*), we do see that there is a promoter region for NFYC! 

(Actual Link: https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A40723814%2D40724413&hgsid=662259431_1MwbUH4ZdXEHXFMFUjJ8AmaRLKWH)

## Summary, Further Exploration, and References

This tutorial gives only a small glimpse of the power of `bedtools` and `pybedtools`. Here are some further links and references:

* http://bedtools.readthedocs.io/en/latest/index.html
* https://daler.github.io/pybedtools/
* http://pedagogix-tagc.univ-mrs.fr/courses/jgb53d-bd-prog/practicals/01_bedtools/index.html
* http://quinlanlab.org/tutorials/bedtools/bedtools.html

Data From:
* ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/
* https://genome.ucsc.edu/cgi-bin/hgTables
* http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgFileUi?db=hg19&g=wgEncodeBroadHmm
