# Visualizing bioinformatics data with plot.ly

This notebook is used for visualising the quality scores of each sample.
It generates an interactive graph, one per sample of all the seqeuences generated from that sample.

https://plot.ly/~johnchase/22/visualizing-bioinformatics-data-with-plo/

In [8]:
!conda search colorlover

Fetching package metadata ...........

PackageNotFoundError: Packages missing in current channels:
            
  - colorlover

We have searched for the packages in the following channels:
            
  - https://repo.continuum.io/pkgs/main/osx-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/osx-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/osx-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/osx-64
  - https://repo.continuum.io/pkgs/pro/noarch
            



In [None]:
# Obtain files

In [1]:
!wget https://github.com/johnchase/plotly-notebook/raw/master/raw_data.tar.gz

--2017-12-14 17:01:35--  https://github.com/johnchase/plotly-notebook/raw/master/raw_data.tar.gz
Resolving github.com... 192.30.255.112, 192.30.255.113
Connecting to github.com|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/johnchase/plotly-notebook/master/raw_data.tar.gz [following]
--2017-12-14 17:01:37--  https://raw.githubusercontent.com/johnchase/plotly-notebook/master/raw_data.tar.gz
Resolving raw.githubusercontent.com... 151.101.104.133
Connecting to raw.githubusercontent.com|151.101.104.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335320 (2.2M) [application/octet-stream]
Saving to: 'raw_data.tar.gz'


2017-12-14 17:01:40 (1.36 MB/s) - 'raw_data.tar.gz' saved [2335320/2335320]



In [2]:
!tar -xvzf raw_data.tar.gz

x ./._msa10.fna
x msa10.fna
x ./._seqs_quals.fastq
x seqs_quals.fastq


In [None]:
#load libraries

In [23]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
import colorlover as cl

import skbio
from skbio.alignment import global_pairwise_align_nucleotide
from skbio.sequence import DNA
import pandas as pd
import itertools
import numpy as np

In [48]:
import plotly 
plotly.tools.set_credentials_file(username='siobhonegan', api_key='4RpshP5pf7nc7tyGvhaD')

In [49]:
py.sign_in('siobhonegan', '')

## 1. Sequence Quality

Because the quality of sequence data produced by high throughput sequencing varies between sequencing runs and samples it is important to look at the sequence quality and possibly filter or trim sequences, or remove samples where the quality is low. A more detailed description of the fastq format and quality scores can be found here [http://scikit-bio.org/docs/latest/generated/skbio.io.format.fastq.html?highlight=fastq#module-skbio.io.format.fastq]. Quality scores themselves are difficult to interpret, so we will use scikit-bio to decode the scores for us. Here we load the sequence data into a generator with scikit-bio.

In [59]:
f = '319ITF-B16_S26_L001_R1_001.fastq'
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')

We can view one of the skbio.sequence._sequence.Sequence entries in the generator object. This will display the sequence data and associated metadata.

In [60]:
seq1 = seqs.__next__()
seq1

Sequence
---------------------------------------------------------------------
Metadata:
    'description': '1:N:0:AAGAGGCA+TCTGCATA'
    'id': 'M03542:139:000000000-AYK6Y:1:1101:18775:1757'
Positional metadata:
    'quality': <dtype: uint8>
Stats:
    length: 250
---------------------------------------------------------------------
0   TGAGTTTGAT CCTGGCTCAG AACGAACGCT ATCGGTATGC TTAACACATG CAAGTCGAAC
60  GGTCTAATTG GGTCTTGCTC CATTTATTTA GTGGCAGACG GGTGAGTAAC ATGTGGGTAT
120 CTACCCATCT GTACTGAATA ACTTTTAGAA ATAAAAGCTA ATACCGTATA TTCTCTACGT
180 AGGAAAGATT TATCGCTGTT GGTTGAGCCC GCGTCTGATT AGGTAGTTGG TGAGGTAATG
240 GCTCACCAAG

Looking through sequence quality on a per-sequence basis is tedious and it would be diffult to decipher meaningful patterns, plotting the data is a better solution. In order to create a meaningful plot we will first create a pd.DataFrame object of the quality scores. Due to limitations in data size and because a subset of our data will fairly accurately represent the quality of the full data set we will only look at the first 500 sequences in our data set.

In [61]:
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')

df = pd.DataFrame()
num_sequences = 500

for count, seq in enumerate(itertools.islice(seqs, num_sequences)):
    df[count] = seq.positional_metadata.quality

Now that we have a dataframe with all of our quality scores it is easy to visualize them with plotly.
We can improve upon a basic boxplot by defining a specific color scheme. Fastq quality scores range from 0-40, and poor quality is often considered to be anything less then 20. Given this we will define a diverging colormap where an average quality score of below 20 will be a shade of red and anything above will be a shade of blue. (20 will be yellow). This will help with distinguishing regions of high versus low quality.

First define the colormap using color lover

In [62]:
purd = cl.scales['11']['div']['RdYlBu']
purd40 = cl.interp(purd, 40)

Now we can make boxplots of the quality scores on a per base basis

In [63]:
traces = []
for e in range(len(df)):
    traces.append(go.Box(
        y=df.iloc[e].values,
        name=e,
        boxpoints='none',
        whiskerwidth=0.2,
        marker=dict(
            size=.1,
            color=purd40[int(round(df.iloc[e].mean(), 0))]
        ),
        line=dict(width=1),
    ))

layout = go.Layout(
    title='Quality Score Distributions',
    yaxis=dict(
        title='Quality Score',
        autorange=True,
        showgrid=True,
        zeroline=True,
        gridcolor='#d9d4d3',
        zerolinecolor='#d9d4d3',
    ),
    xaxis=dict(
        title='Base Position',
    ),

    font=dict(family='Times New Roman', size=16, color='#2e1c18'),
    paper_bgcolor='#eCe9e9',
    plot_bgcolor='#eCe9e9'
)

fig = go.Figure(data=traces, layout=layout)
py.iplot(fig, filename='quality-scores')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~siobhonegan/0 or inside your plot.ly account where it is named 'quality-scores'


## 2. Sequence Alignment

One of the most important steps in bioinformatics is sequence alignment. Aligning sequences helps us to understand the relationship between two or more sequences, it allows us to identify specific bases or regions that vary between sequences and can be used to discover novel sequences. For the purpose of this notebook we will use the global pariwise aligner from skbio this is known to be slow and should be updated soon. If you are aligning more than a few sequences, there are other faster aligners that would be preferrable such as MAFFT. QIIME2 provides a convenient set of tools that wrap aligners such as MAFFT.

### Align the first two sequences using scikit-bio

This is slow, and really only appropriate for educational purposes. In fact scikit-bio will generate a warning to this effect. You can find more information about the scikit-bio aligner in this github issue. If you are aligning sequences locally scikit-bio has optimized algorithms appropriate for larger scale data.

Once again we load the sequences using scikit-bio. Here we will only load the first two sequences in order to illustrate pairwise alignment.

In [64]:
seqs = [DNA(e) for e in itertools.islice(skbio.io.read(f, format='fastq', variant='illumina1.8'), 2)]

In [65]:
aligned_seqs = global_pairwise_align_nucleotide(seqs[0], seqs[1])
aligned_seqs


You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this).



(TabularMSA[DNA]
 -----------------------------------------------------------------------
 Stats:
     sequence count: 2
     position count: 499
 -----------------------------------------------------------------------
 TGAGTTTGATCCTGGCTCAGAACGAACGCTATC ... ---------------------------------
 --------------------------------- ... GCTCTAGGATTAGCCTACGTCGGATTTGCTAGT,
 2.0,
 [(0, 249), (0, 250)])

### Align multiple sequences

Pairwise alignment is a more simple task than multiple sequence alignment. With multiple sequences each additional sequence changes the overall alignment, meaning it must be constructed progressively with each additional sequence. This can be very computationally expensive. If you do not wish to wait for the alignment to run (or do not wish to install An Introduction to Bioinformatics, you can leave the following lines commented out and load the pre-aligned sequences.

In [32]:
# from iab.algorithms import progressive_msa, tree_from_distance_matrix
# from functools import partial

# f = 'run1_16s/rev_seqs/1AM1JR7QWMSFA.fastq'
# seqs = [DNA(e) for e in itertools.islice(skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8'), 10)]
# seqs = [e for e in seqs if not e.has_degenerates()]
# msa = progressive_msa(seqs, global_pairwise_align_nucleotide)
# msa.write('msa10.fna')
msa = skbio.alignment.TabularMSA.read('msa10.fna', constructor=DNA)

In order to create a meaningful visualization of the multiple sequence alignment we will use heatmap functionality of plotly.
First we assign a numeric value to each possible character in our alignment, "A", "T", "G", "C", and "-"

In [33]:
base_dic = {'A': 1, 'C': .25, 'G': .5, 'T': .75, '-': 0}

Next we define a function that takes an alignment and returns two, two-dimensional arrays, one of the characters in the alignment and one of the numeric value that represents the character defined in the dictionary above. The numeric value will be used to define the color in the heatmap. The function below will create an array of the alignment such the only bases that are colored differently are bases that are different from the first sequence. This will make identifying differences in sequences much easier. If each based was given a unique color regardless of it's relationship to other sequences the plot would be noisy and difficult to interpret.

In [34]:
def seq_align_for_plot(msa):
    base_text = [list(str(e)) for e in msa]
    base_values = np.zeros((len(base_text), len(base_text[0])))
    for i in range(len(base_text[0])):
        for j in range(len(base_text)):
            if base_text[j][i] != base_text[0][i]:
                base_values[j][i] = base_dic[base_text[j][i]]
    return(base_text, base_values)

base_text, base_values = seq_align_for_plot(msa)

Define a colorscale where the values for each base is given a defined color

In [35]:
colorscale=[[0.00, '#F4F0E4'], 
            [0.25, '#1b9e77'], 
            [0.50, '#d95f02'], 
            [0.75, '#7570b3'],
            [1.00, '#e7298a']]

Create a list of arbitrary sequences names (The original sequence names were randomly generated and do not have meaning associated with them).

In [36]:
seq_names = ["Seq " + str(e + 1) for e in range(len(base_text))]

Finally we can plot the alignment.


In [53]:
fig = FF.create_annotated_heatmap(base_values, 
                                  annotation_text=base_text, 
                                  colorscale=colorscale)

fig['layout'].update(
    title="Aligned Sequences",
    xaxis=dict(ticks='', 
               side='top',
               ticktext=list(np.arange(0, len(base_text[0]), 10)),
               tickvals=list(np.arange(0, len(base_text[0]), 10)),
               showticklabels=True,
               tickfont=dict(family='Bookman', 
                             size=18, 
                             color='#22293B',
                            ),
              ),
    
    yaxis=dict(autorange='reversed',
               ticks='', 
               ticksuffix='  ',
               ticktext=seq_names,
               tickvals=list(np.arange(0, len(base_text))),
               showticklabels=True,
               tickfont=dict(family='Bookman', 
                         size=18, 
                         color='22293B',
                            ),
              ),
    width=10000,
    height=450,
    autosize=True,
    annotations=dict(font=dict(family='Courier New, monospace',
                                size=14,
                                color='#3f566d'
                               ),
                      )
)
py.iplot(fig, filename='msa')


plotly.tools.FigureFactory.create_annotated_heatmap is deprecated. Use plotly.figure_factory.create_annotated_heatmap

