# Visualizing bioinformatics data with plot.ly

This notebook is used for visualising the quality scores of each sample.
It generates an interactive graph, one per sample of all the seqeuences generated from that sample.

URL https://plot.ly/~johnchase/22/visualizing-bioinformatics-data-with-plo/

In [8]:
!conda search colorlover

Fetching package metadata ...........

PackageNotFoundError: Packages missing in current channels:
            
  - colorlover

We have searched for the packages in the following channels:
            
  - https://repo.continuum.io/pkgs/main/osx-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/osx-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/osx-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/osx-64
  - https://repo.continuum.io/pkgs/pro/noarch
            



In [None]:
# Obtain files

In [1]:
!wget https://github.com/johnchase/plotly-notebook/raw/master/raw_data.tar.gz

--2017-12-14 17:01:35--  https://github.com/johnchase/plotly-notebook/raw/master/raw_data.tar.gz
Resolving github.com... 192.30.255.112, 192.30.255.113
Connecting to github.com|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/johnchase/plotly-notebook/master/raw_data.tar.gz [following]
--2017-12-14 17:01:37--  https://raw.githubusercontent.com/johnchase/plotly-notebook/master/raw_data.tar.gz
Resolving raw.githubusercontent.com... 151.101.104.133
Connecting to raw.githubusercontent.com|151.101.104.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335320 (2.2M) [application/octet-stream]
Saving to: 'raw_data.tar.gz'


2017-12-14 17:01:40 (1.36 MB/s) - 'raw_data.tar.gz' saved [2335320/2335320]



In [2]:
!tar -xvzf raw_data.tar.gz

x ./._msa10.fna
x msa10.fna
x ./._seqs_quals.fastq
x seqs_quals.fastq


In [None]:
#load libraries

In [23]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
import colorlover as cl

import skbio
from skbio.alignment import global_pairwise_align_nucleotide
from skbio.sequence import DNA
import pandas as pd
import itertools
import numpy as np

In [48]:
import plotly 
plotly.tools.set_credentials_file(username='siobhonegan', api_key='4RpshP5pf7nc7tyGvhaD')

In [49]:
py.sign_in('siobhonegan', '')

## 1. Sequence Quality

Because the quality of sequence data produced by high throughput sequencing varies between sequencing runs and samples it is important to look at the sequence quality and possibly filter or trim sequences, or remove samples where the quality is low. A more detailed description of the fastq format and quality scores can be found here [http://scikit-bio.org/docs/latest/generated/skbio.io.format.fastq.html?highlight=fastq#module-skbio.io.format.fastq]. Quality scores themselves are difficult to interpret, so we will use scikit-bio to decode the scores for us. Here we load the sequence data into a generator with scikit-bio.

In [59]:
f = '319ITF-B16_S26_L001_R1_001.fastq'
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')

We can view one of the skbio.sequence._sequence.Sequence entries in the generator object. This will display the sequence data and associated metadata.

In [60]:
seq1 = seqs.__next__()
seq1

Sequence
---------------------------------------------------------------------
Metadata:
    'description': '1:N:0:AAGAGGCA+TCTGCATA'
    'id': 'M03542:139:000000000-AYK6Y:1:1101:18775:1757'
Positional metadata:
    'quality': <dtype: uint8>
Stats:
    length: 250
---------------------------------------------------------------------
0   TGAGTTTGAT CCTGGCTCAG AACGAACGCT ATCGGTATGC TTAACACATG CAAGTCGAAC
60  GGTCTAATTG GGTCTTGCTC CATTTATTTA GTGGCAGACG GGTGAGTAAC ATGTGGGTAT
120 CTACCCATCT GTACTGAATA ACTTTTAGAA ATAAAAGCTA ATACCGTATA TTCTCTACGT
180 AGGAAAGATT TATCGCTGTT GGTTGAGCCC GCGTCTGATT AGGTAGTTGG TGAGGTAATG
240 GCTCACCAAG

Looking through sequence quality on a per-sequence basis is tedious and it would be diffult to decipher meaningful patterns, plotting the data is a better solution. In order to create a meaningful plot we will first create a pd.DataFrame object of the quality scores. Due to limitations in data size and because a subset of our data will fairly accurately represent the quality of the full data set we will only look at the first 500 sequences in our data set.

In [61]:
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')

df = pd.DataFrame()
num_sequences = 500

for count, seq in enumerate(itertools.islice(seqs, num_sequences)):
    df[count] = seq.positional_metadata.quality

Now that we have a dataframe with all of our quality scores it is easy to visualize them with plotly.
We can improve upon a basic boxplot by defining a specific color scheme. Fastq quality scores range from 0-40, and poor quality is often considered to be anything less then 20. Given this we will define a diverging colormap where an average quality score of below 20 will be a shade of red and anything above will be a shade of blue. (20 will be yellow). This will help with distinguishing regions of high versus low quality.

First define the colormap using color lover

In [62]:
purd = cl.scales['11']['div']['RdYlBu']
purd40 = cl.interp(purd, 40)

Now we can make boxplots of the quality scores on a per base basis

In [63]:
traces = []
for e in range(len(df)):
    traces.append(go.Box(
        y=df.iloc[e].values,
        name=e,
        boxpoints='none',
        whiskerwidth=0.2,
        marker=dict(
            size=.1,
            color=purd40[int(round(df.iloc[e].mean(), 0))]
        ),
        line=dict(width=1),
    ))

layout = go.Layout(
    title='Quality Score Distributions',
    yaxis=dict(
        title='Quality Score',
        autorange=True,
        showgrid=True,
        zeroline=True,
        gridcolor='#d9d4d3',
        zerolinecolor='#d9d4d3',
    ),
    xaxis=dict(
        title='Base Position',
    ),

    font=dict(family='Times New Roman', size=16, color='#2e1c18'),
    paper_bgcolor='#eCe9e9',
    plot_bgcolor='#eCe9e9'
)

fig = go.Figure(data=traces, layout=layout)
py.iplot(fig, filename='quality-scores')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~siobhonegan/0 or inside your plot.ly account where it is named 'quality-scores'
