<img src="Demo1Slide1.JPG">

<img src="Demo1Slide2.JPG">

<img src="Demo1Slide3.JPG">

<img src="Demo1Slide4.JPG">

<img src="Demo1Slide5.JPG">

<img src="Demo1Slide6.JPG">

<img src="Demo1Insert2.JPG">

<img src="Demo1Slide7.JPG">

## Here is an excerpt from the demonstration file directory. 

<img src="Demo1Insert3.JPG">

cohort_125.seqset includes all of the sequence data from 17 "platinum" genomes taken from the CEPH/Utah family and 108 publicly available genomes of Qataris from the Sequence Read Archive (SRA). 

In the form of BAM files this would require around 13 terabytes of storage, but as a 
"Biograph Database" it only requires about 0.3 TB of storage. The ".seqset" file which holds the "merged graph" for 125 genomes requires 114GB of storage. The ".readset" file which holds the individual index for each genome requires about 2GB per genome.
**Biograph databases reduce genome storage requirements by a factor of over 30!**  
The precise level of storage compression possible depends on the number of genomes stored together in 
one database and their heterogeneity. This extremely high level of compression is possible because the 
database uses a "merged graph structure," conceptually illustrated in Slides 6 and 7. 

The merged graph contains every sequence that occurs in at least one of the merged genomes. When a genome is added to a merged graph, all that needs to be recorded are the new, unique sequences it contains. Each individual genome can then be reconstructed from an "individual index" that enumerates the exact series of sequences from the merged graph that makes up the individual. 

The very high level of compression is possible because there is enormous redundancy in a group of human genomes. The sequence information in an isolated human genome is almost as unpredictable as a random sequence of A's, T's, C's, and G's, and, therefore, does not lend itself to high lossless compression. However, the sequence information in multiple human genomes is 99.9% redundant, permitting much higher compression. The larger the set of genomes combined in a graph, the higher the compression, since only a limited number (estimated between 20 and 40 million) 
of de novo or inherited variations exist in the entire human population.

Very high speed query is enabled by the compact search structure, which enables it to be fully loaded into high speed RAM memory.

## Demonstration of BioGraph query speed

In [8]:
from libspiral import seqset
import time

max_results = 20
sample = raw_input('Which graph? ')

my_graph = seqset('/mnt/{0}.gbwt'.format(sample))

while True:
    sequence = raw_input('What sequence? ')
    if sequence == '':
        break
    new_ctx = my_graph.find(sequence)
    if(new_ctx.valid):
        count = new_ctx.end - new_ctx.begin
        start = time.time()
        print 'Found {0} entries\n'.format(count)
        if count > max_results:
            print 'Showing the first {0} matches'.format(max_results)
            count = max_results
        for i in range(new_ctx.begin, new_ctx.begin + count):
            print my_graph.entry(i).sequence
        print 'Query time: {0:.5f} seconds'.format(time.time() - start)
        print
    else:
        print 'No entries found.'

KeyboardInterrupt: 

## The examples that follow make much of use of the CEPH 1463 Pedigree, a Utah family of Northern European Ancestry from the Cornell Institute for Medical Research, http://bit.ly/1Kc2q9m

<img src="Demo1Insert1.JPG">

## Demonstration 1 of Real-Time Assembly and Visualization of A Structural Variant  -- an insertion in NA12878 and how it was inherited (validated by Evan Eichler at U of W) 

In [None]:
# Import the Biograph library and API
from biograph import new_graph, reference, find_variants, visualize

# Import the graph genome of Mother
bg = new_graph("/mnt/NA12878_S1.gbwt")

# Import a human reference
grch37 = reference("/reference/human_g1k_v37/")

# Now let's examine an insertion in NA12878 (Mother) validated by Evan Eichler of U of W
grch37_coords = find_variants(bg, grch37, "8", 88268339, 88269142)

# This is a homozygous 2.457 Kbase insertion in place of a 63 bp deleteion in grch37_coords:
for v in grch37_coords:
    visualize(v.variants, v.coverage, v.ref_range)

In [None]:
# Here are examples of Python operations we can perform on it

# List all of the variants in this range (there is only 1)
grch37_coords[0].variants

In [None]:
# Identify that the insertion is a structural variant, - -
sv = grch37_coords[0].variants[0]  # which variant index should we use?
print sv.is_structural, sv.left_forward, sv.right_forward

In [None]:
# Show the reverse complement of the sequence
print str(sv.flip().sequence)

In [None]:
# Now let's look at the Maternal Grandfather, MGFather
# Import the graph genome
bg = new_graph("/mnt/NA12891_S1.gbwt")

# Visualizing MGFather the insertion has identical location and sequence and it is homozygous
for v in grch37_coords:
    visualize(v.variants, v.coverage, v.ref_range)

In [None]:
# Import the graph genome of the MGMother
bg = new_graph("/mnt/NA12892_S1.gbwt")

# Insertion in NA12892 (MGMother) validated by Eichler
grch37_coords = find_variants(bg, grch37, "8", 88268339, 88269142)

# Visualizing MGMother: the insertion has identical location and sequence, and is heterozygous
for v in grch37_coords:
    visualize(v.variants, v.coverage, v.ref_range)

## Demonstration 2 of Real-Time Assembly and Visualization of A Structural Variant  -- a deletion in NA12878 and how it was inherited (validated by Evan Eichler at U of W) 

In [None]:
# Import the graph genome of Mother
bg = new_graph("/mnt/NA12878_S1.gbwt")

# Deletion in NA12878 (Mother) validated by Eichler
grch37_coords = find_variants(bg, grch37, "5", 12810916, 12820623)

# Let's see it -- there is about a 9.5 Kbase deletion in one strand, and two SNPs in the other
for v in grch37_coords:
    visualize(v.variants, v.coverage, v.ref_range)

In [None]:
# Here are examples of Python operations we can perform on it

# List all of the variants in this range (there is only 1)
grch37_coords[0].variants

In [None]:
# Identify that the insertion is a structural variant, - -
sv = grch37_coords[0].variants[1]
print sv.is_structural, sv.left_forward, sv.right_forward

In [None]:
# Show the reverse complement of the sequence (there is no sequence, because this is a deletion)
print str(sv.flip().sequence)

In [None]:
# Import the graph genome of the MGFather
bg = new_graph("/mnt/NA12891_S1.gbwt")

# Deletion in NA12891 (gfather) validated by Eichler
grch37_coords = find_variants(bg, grch37, "5", 12810916, 12820623)

# Let's see it -- in the gfather the deletion has identical location and sequence, but it is homozygous
for v in grch37_coords:
    visualize(v.variants, v.coverage, v.ref_range)

In [None]:
# Import the graph genome of the MGMother
bg = new_graph("/mnt/NA12892_S1.gbwt")

# SNPs in NA12892 (gmother) validated by Eichler
grch37_coords = find_variants(bg, grch37, "5", 12810916, 12820623)

# Let's see it -- the gmother is homozygous in the SNPs
for v in grch37_coords:
    visualize(v.variants, v.coverage, v.ref_range)

<img src="Demo1Slide8.JPG">

## Demonstration of Querying for a sequence in an individual -- is the insertion or deletion found in NA12878 found in one of our Qatari samples, and where does it occur? This is like the search for a SNP below.

In [None]:
# ROB -- HELP!
"""
    Search a population for a set of SNPs.
"""
import libspiral
from biograph import *
import glob

# Load the cohort biograph, two references, and all the sample ID's
bg = new_graph("/mnt/cohort_125.gbwt")

ref37 = reference("/reference/human_g1k_v37")
ref38 = reference("/reference/homo_sapiens_GCA_000001405.19_GRCh38.p4")

# The biograph files are named for the sample they contain
sample_names = [('.'.join(x.split('/')[2].split('.')[:-1])) for x in glob.glob("/mnt/*.bitmap")]
samples = {}
for sn in sample_names:
    samples[sn] = bg.load_readset("/mnt/" + sn + ".bitmap")
    
# Load SNPs and put into lists
with open("drug_response_snps", "rb") as f:
    data = f.read()
snps = [x.split('\t') for x in data.split('\n')[:-1]]

# Define a function to compute coverage snp for a sample
def coverage(bg, ref, readset, snp, base, rlen = 100):
    seq = ref.make_range(snp[0], int(snp[2])-rlen, int(snp[2])+rlen+1, True).sequence
    seq[rlen] = base
    return bg.seq_coverage(seq, readset)[rlen]

# Single individual, single allele coverage
coverage(bg, ref38, samples['NA12878_S1'], snps[0], 'C')

## Demonstration of determining what reads provide evidence for a variant. Take the genome and variant above and show the coverage rising and falling around the anchors. This is like the example below.

In [None]:
# ROB -- HELP!

"""
    Simple SV lookups and assembly coverage reporting with BioGraph
"""
import libspiral
from biograph import *

# Import a reference
ref37 = reference("/reference/human_g1k_v37")

# Load 125 individuals
bg = new_graph("/mnt/cohort_125.gbwt")

# Choose between individuals in a merged BioGraph using readsets
na12877 = bg.load_readset("/mnt/NA12877_S1.bitmap") 
na12878 = bg.load_readset("/mnt/NA12878_S1.bitmap") 
na12879 = bg.load_readset("/mnt/NA12879_S1.bitmap") 
na12880 = bg.load_readset("/mnt/NA12880_S1.bitmap") 

# Pick a region of interest
chromosome = "2"
start = 100000
end = 200000

# Find variants in the region of interest
vars = find_variants(bg, ref37, chromosome, start, end, readset=na12878)

v = vars[0].variants[11]

# Compute coverage for the assembly in each individual
coverage77 = bg.seq_coverage(v.assembly_sequence, na12877)
coverage78 = bg.seq_coverage(v.assembly_sequence, na12878)
coverage79 = bg.seq_coverage(v.assembly_sequence, na12879)
coverage80 = bg.seq_coverage(v.assembly_sequence, na12880)

# Coverage is a list of numbers representing the depth of coverage
# for each base in the sequence
bg.seq_coverage(v.assembly_sequence, na12879)[10:30]

In [None]:
# Let's plot the coverage for each individual

%matplotlib inline
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

plt.figure()

fig, ax = plt.subplots(figsize=(14,5))

line1 = plt.plot(coverage77, color='orange', linewidth=2)
# line2 = plt.plot(coverage78, color='lightgreen', linewidth=2)
# line3 = plt.plot(coverage79, color='darkgray', linewidth=2)
# line4 = plt.plot(coverage80, color='lightblue', linewidth=2)

plt.xlabel('Position')
plt.ylabel('Coverage')
plt.title('Read coverage of variant assembly by position'
)
plt.grid(True)

# ax.legend(line1[0], 'NA12877')
          
plt.show()

## Huge EXTRA CREDIT. Analogous to the Baylor work, demonstrate using these tools to crisply identify break ends and the exact sequence of SV's sloppily identified by variant callers.

## Demonstration of Querying for a sequence in a set of genomes -- In our entire cohort, which genomes have the insertion/deletion? At what position relative to reference 37 does it occur? 

In [None]:
# ROB -- HELP!

# Now, for a single SNP, let's examine the allele breakdown for all individuals in the cohort
snp = snps[0]
for name, sample in samples.iteritems():
    if name[0].islower():
        continue # Skip references
    out = "%12s: " % (name,)
    for allele in snp[4].split('/'):
        out += "%s=%2d " % (
            allele, 
            coverage(bg, ref38, sample, snp, allele)
        )
    print out

## Demonstration of Querying for a sequence in a set of genomes -- Given a table of SNP's and SV's. In our entire cohort, which genomes have certain variants? At what position relative to reference 37 does each variant occur? How large is it? See the SV examples above. Ideally the input to this would be an SQLite table of SNPs and SVs and the output would be a table relating variants and individuals to frequencies of occurence.

## Demonstration of Querying for a sequence in a set of genomes -- In a specified region, for the entire cohort, what  variations occur with what frequencies?   (Result is a histogram) In this case, instead of looking for a specific table of SNPs or variants we're comparing the individuals to a reference.

## Huge EXTRA CREDIT. Do the above, but in addition to, or instead of a reference genome, compare the individuals to a set of healthy individuals. Related EXTRA CREDIT: Look up a variant that may be a rare de novo variant in a table of healthy or "normal" individual to determine if it is de novo or rare.

## Demonstrate that we can change the reference genome. Demonstrate that an SV based on human reference 37 in one of the genome samples is not an SV in human reference 38 (because the SV is included in the reference). This is like the example below.

In [None]:
"""
    Switch between multiple references.
"""
from biograph import new_graph, reference, find_variants, visualize

# Import data
bg = new_graph("/mnt/NA12878_S1.gbwt")

# Import two references
grch37 = reference("/reference/human_g1k_v37/")
grch38 = reference("/reference/homo_sapiens_GCA_000001405.19_GRCh38.p4/")

# Find variants for both reference coordinate systems
grch37_coords = find_variants(bg, grch37, "1", 245822567, 245824567)

# The equivalent locus in GRCh38
# http://www.ncbi.nlm.nih.gov/nuccore/KI270759.1
grch38_coords = find_variants(bg, grch38, "KI270759.1", 356442, 360442)

# Let's see GRCh37
for v in grch37_coords:
    visualize(v.variants, v.coverage, v.ref_range)

# Huge EXTRA CREDIT. Illustrate comparison of two sets of data to find alleles that are common in one and uncommon in the other, e.g. early onset Alzheimers vs. others

<img src="Demo1Slide9.JPG">

<img src="Demo1Slide10.JPG">

<img src="Demo1Slide11.JPG">

<img src="Demo1Slide12.JPG">

<img src="Demo1Slide13.JPG">

<img src="Demo1Slide14.JPG">

<img src="Demo1Slide15.JPG">

<img src="Demo1Slide17.JPG">