# Lamprey Transcriptome Analysis

```
Camille Scott [camille dot scott dot w @gmail.com] [@camille_codon]

camillescott.github.io

Lab for Genomics, Evolution, and Development
Michigan State University
```

## About

This notebook is the entry point for the [Petromyzon marinus](http://nas.er.usgs.gov/queries/FactSheet.aspx?speciesID=836) (sea lamprey) de novo transcriptome analysis. This entry notebook contains links for the others, and code to collect and format data for the other notebooks. It should be run before all other notebooks in order to generate the requisite data.

## Contents

1. [Transcript Analysis Notebook](petmar-transcripts.ipynb)
2. [Tissue Analysis Notebook](petmar-tissues.ipynb)
3. [Protein Analysis Notebook](petmar-proteins.ipynb)
4. [Taxonomic Analysis Notebook](petmar-taxonomy.ipynb)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from libs import *
%run -i common.ipy

** Using data resources found in ../resources.json
** Using config found in ../config.json


In [3]:
wdir()

u'../_work/'

### Databases

In [4]:
resources_df[resources_df.meta_type != 'sample']

Unnamed: 0,access,condition,db_type,filename,flowcell,label,meta_type,paired,phred,q_type,strand,terms,tissue,url,size
Myxinidae_pep,remote_query,,prot,Myx.pep.all.fa,,,fasta_database,,,uniprot,,taxonomy:7762,,,1004
braFlo_pep,remote_query,,prot,braFlo.pep.all.fa,,,fasta_database,,,uniprot,,organism:7739+AND+keyword:1185,,,28544
danRer_pep,remote_file,,prot,danRer.pep.fa,,,fasta_database,,,,,,,ftp://ftp.ensembl.org/pub/release-75/fasta/dan...,43153
homSap_pep,remote_file,,prot,homSap.pep.fa,,,fasta_database,,,,,,,ftp://ftp.ensembl.org/pub/release-74/fasta/hom...,104763
lamp10,remote_file,,nucl,lamp10.fasta,,,assembly,,,,,,,http://athyra.ged.msu.edu/~camille/lamprey/lam...,715345
musMus_pep,remote_file,,prot,musMus.pep.fa,,,fasta_database,,,,,,,ftp://ftp.ensembl.org/pub/release-75/fasta/mus...,52165
petMar2,remote_file,,nucl,petMar2.fa,,,fasta_database,,,,,,,ftp://ftp.ensembl.org/pub/release-75/fasta/pet...,25006
petMar2_cdna,remote_file,,nucl,petMar2.cdna.fa,,,fasta_database,,,,,,,ftp://ftp.ensembl.org/pub/release-75/fasta/pet...,11489
petMar2_cds,remote_file,,nucl,petMar2.cds.fa,,,fasta_database,,,,,,,ftp://ftp.ensembl.org/pub/release-75/fasta/pet...,11442
petMar2_gtf,remote_file,,feature,petMar2.gtf,,,gtf_database,,,,,,,http://ftp.ensembl.org/pub/release-75/gtf/petr...,0


### Samples

In [5]:
sample_df

Unnamed: 0,access,condition,db_type,filename,flowcell,label,meta_type,paired,phred,q_type,strand,terms,tissue,url,tissue_c,color
614GMAAXX_1_1_pf,local_file,stg18,nucl,614GMAAXX_1_1_pf.qc.fq.gz,614GM,embryo_stg18_614GM,sample,False,phred33,,+/-,,embryo,,1,"(0.996493656495, 0.998646674437, 0.702575950062)"
614GMAAXX_2_1_pf,local_file,stg20,nucl,614GMAAXX_2_1_pf.qc.fq.gz,614GM,embryo_stg20_614GM,sample,False,phred33,,+/-,,embryo,,1,"(0.996493656495, 0.998646674437, 0.702575950062)"
614GMAAXX_3_1_pf,local_file,stg22a,nucl,614GMAAXX_3_1_pf.qc.fq.gz,614GM,embryo_stg22a_614GM,sample,False,phred33,,+/-,,embryo,,1,"(0.996493656495, 0.998646674437, 0.702575950062)"
614GMAAXX_4_1_pf,local_file,stg22b,nucl,614GMAAXX_4_1_pf.qc.fq.gz,614GM,embryo_stg22b_614GM,sample,False,phred33,,+/-,,embryo,,1,"(0.996493656495, 0.998646674437, 0.702575950062)"
614GMAAXX_6_1_pf,local_file,stg23,nucl,614GMAAXX_6_1_pf.qc.fq.gz,614GM,embryo_stg23_614GM,sample,False,phred33,,+/-,,embryo,,1,"(0.996493656495, 0.998646674437, 0.702575950062)"
614GMAAXX_7_1_pf,local_file,stg24c1,nucl,614GMAAXX_7_1_pf.qc.fq.gz,614GM,embryo_stg24c1_614GM,sample,False,phred33,,+/-,,embryo,,1,"(0.996493656495, 0.998646674437, 0.702575950062)"
614GMAAXX_8_1_pf,local_file,stg24c2,nucl,614GMAAXX_8_1_pf.qc.fq.gz,614GM,embryo_stg24c2_614GM,sample,False,phred33,,+/-,,embryo,,1,"(0.996493656495, 0.998646674437, 0.702575950062)"
AB057JABXX_s_7_pe,local_file,larval,nucl,AB057JABXX_s_7_pe.trim.fq.gz,AB057,brain_larval_AB057,sample,True,phred64,,+/-,,brain,,0,"(0.552941203117, 0.827450990677, 0.780392169952)"
AI,local_file,adult,nucl,AI.fq.gz,614GM,intest_adult_614GM,sample,False,phred33,,+/-,,intest,,5,"(0.991018839443, 0.706528275854, 0.384421383283)"
AK,local_file,adult,nucl,AK.fq.gz,614GM,kidney_adult_614GM,sample,False,phred33,,+/-,,kidney,,7,"(0.98486735961, 0.804705893993, 0.892318345168)"


## Data

Create an HDF5 volume to store data needed in the other notebooks; we won't be performing ops directly on it, so use maximum compression to save disk space. 

In [6]:
store = pd.HDFStore(wdir('{}.store.h5'.format(prefix)), complib='zlib', complevel=5)

In [7]:
import atexit
def exit_func():
    dump_results()
    store.close()
atexit.register(exit_func)

<function __main__.exit_func>

In [8]:
for fn in resources_df[resources_df.meta_type == 'assembly'].filename:
    screed.read_fasta_sequences(wdir(fn))

### Transcript Support

In [9]:
tpm_df = pd.read_csv(wdir('lamp10.eXpress.tpm.tsv'), delimiter='\t', index_col=0)

In [10]:
labels = dict(zip(sample_df.filename, sample_df.label))

In [11]:
tpm_df.rename(columns=labels, inplace=True)

In [12]:
tpm_df.sort(axis=1, inplace=True)

In [13]:
store['lamp10.eXpress.tpm.tsv'] = tpm_df

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->axis0] [items->None]

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->block0_items] [items->None]



### Blast Results

In [14]:
%load_ext cython

In [15]:
%%cython
cimport numpy as np
import numpy as np
# Returns: 
# [0: sstart
#  1: send
#  2: qstart
#  3: qend
#  4: sstrand
#  5: qstrand]
cdef np.ndarray[long] fix_coords_single(long sstart, long send, long qstart, long qend):
    cdef np.ndarray[long] res = np.empty(6, dtype=long)
    
    if sstart < send:
        res[0] = sstart - 1
        res[1] = send
        res[4] = 1
    else:
        res[0] = send
        res[1] = sstart + 1
        res[4] = -1
    
    if qstart < qend:
        res[2] = qstart - 1
        res[3] = qend
        res[5] = 1
    else:
        res[2] = qend
        res[3] = qstart + 1
        res[5] = -1
    
    return res

cpdef np.ndarray[long, ndim=2] fix_blast_coords(np.ndarray[long] sstart, np.ndarray[long] send, 
                                                    np.ndarray[long] qstart, np.ndarray[long] qend):
    cdef long n = len(sstart)
    cdef long i = 0
    cdef np.ndarray[long, ndim=2] res = np.empty((n,6), dtype=long)
    for i in range(n):
        res[i,:] = fix_coords_single(sstart[i], send[i], qstart[i], qend[i])
    return res

In [16]:
import blasttools

In [17]:
def fix_blast_coords_df(df):
    coords = fix_blast_coords(df.sstart.values, df.send.values, df.qstart.values, df.qend.values)
    df['sstart'] = coords[:,0]
    df['send'] = coords[:,1]
    df['qstart'] = coords[:,2]
    df['qend'] = coords[:,3]
    df['sstrand'] = coords[:,4]
    df['qstrand'] = coords[:,5]

In [18]:
blast_items = resources_df[resources_df.meta_type.isin(['sample', 'assembly', 'gtf_database']) == False]
for i, (dbname, info) in enumerate(blast_items.iterrows()):
    target = '{}.fasta.x.{}.db.tsv'.format('lamp10', info['filename'])
    print dbname, target
    
    df = blasttools.blast_to_df(wdir(target))
    fix_blast_coords_df(df)

    store[target] = df

    tmp = pd.merge(pd.DataFrame(index=tpm_df.index), df,
                      left_index=True, right_index=True, how='left')
    blasttools.best_hits(tmp)

    if i == 0:
        lamp10_best_hits = pd.Panel({target: tmp})
    else:
        lamp10_best_hits[target] = tmp

Myxinidae_pep lamp10.fasta.x.Myx.pep.all.fa.db.tsv
braFlo_pep



 lamp10.fasta.x.braFlo.pep.all.fa.db.tsv
danRer_pep lamp10.fasta.x.danRer.pep.fa.db.tsv
homSap_pep



 lamp10.fasta.x.homSap.pep.fa.db.tsv
musMus_pep lamp10.fasta.x.musMus.pep.fa.db.tsv
petMar2



 lamp10.fasta.x.petMar2.fa.db.tsv
petMar2_cdna lamp10.fasta.x.petMar2.cdna.fa.db.tsv
petMar2_cds



 lamp10.fasta.x.petMar2.cds.fa.db.tsv
petMar2_ncrna lamp10.fasta.x.petMar2.ncrna.fa.db.tsv
petMar2_pep



 lamp10.fasta.x.petMar2.pep.fa.db.tsv


In [19]:
store['lamp10_best_hits'] = lamp10_best_hits

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['lamp10.fasta.x.Myx.pep.all.fa.db.tsv', 'lamp10.fasta.x.braFlo.pep.all.fa.db.tsv', 'lamp10.fasta.x.danRer.pep.fa.db.tsv', 'lamp10.fasta.x.homSap.pep.fa.db.tsv', 'lamp10.fasta.x.musMus.pep.fa.db.tsv', 'lamp10.fasta.x.petMar2.fa.db.tsv', 'lamp10.fasta.x.petMar2.cdna.fa.db.tsv', 'lamp10.fasta.x.petMar2.cds.fa.db.tsv', 'lamp10.fasta.x.petMar2.ncrna.fa.db.tsv', 'lamp10.fasta.x.petMar2.pep.fa.db.tsv']]



In [20]:
blast_items = resources_df[resources_df.meta_type.isin(['sample', 'assembly', 'gtf_database']) == False]
for i, (dbname, info) in enumerate(blast_items.iterrows()):
    A_fn = '{}.fasta.x.{}.db.tsv'.format('lamp10', info.filename)
    B_fn = '{}.db.x.{}.fasta.tsv'.format(info.filename, 'lamp10')
    
    print '{} <=> {}'.format(A_fn, B_fn)
    
    A = pd.read_table(wdir(A_fn), header=None, index_col=0, names=outfmt6)
    B = pd.read_table(wdir(B_fn), header=None, index_col=0, names=outfmt6)

    fix_blast_coords_df(A)
    fix_blast_coords_df(B)
    
    X = blasttools.get_orthologies(A, B, tpm_df.index)
    
    if i == 0:
        lamp10_ortho = pd.Panel({A_fn: X})
    else:
        lamp10_ortho[A_fn] = X

lamp10.fasta.x.Myx.pep.all.fa.db.tsv <=> Myx.pep.all.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.braFlo.pep.all.fa.db.tsv <=> braFlo.pep.all.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.danRer.pep.fa.db.tsv <=> danRer.pep.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.homSap.pep.fa.db.tsv <=> homSap.pep.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.musMus.pep.fa.db.tsv <=> musMus.pep.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.petMar2.fa.db.tsv <=> petMar2.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.petMar2.cdna.fa.db.tsv <=> petMar2.cdna.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.petMar2.cds.fa.db.tsv <=> petMar2.cds.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.petMar2.ncrna.fa.db.tsv <=> petMar2.ncrna.fa.db.x.lamp10.fasta.tsv
lamp10.fasta.x.petMar2.pep.fa.db.tsv <=> petMar2.pep.fa.db.x.lamp10.fasta.tsv


In [21]:
store['lamp10_ortho'] = lamp10_ortho

In [22]:
import glob

In [23]:
glob.glob(wdir('petMar2.cdna.fa.x*.tsv'))

[u'../_work/petMar2.cdna.fa.x.Myx.pep.all.fa.db.tsv',
 u'../_work/petMar2.cdna.fa.x.homSap.pep.fa.db.tsv',
 u'../_work/petMar2.cdna.fa.x.petMar2.pep.fa.db.tsv',
 u'../_work/petMar2.cdna.fa.x.danRer.pep.fa.db.tsv',
 u'../_work/petMar2.cdna.fa.x.musMus.pep.fa.db.tsv',
 u'../_work/petMar2.cdna.fa.x.petMar2.fa.db.tsv',
 u'../_work/petMar2.cdna.fa.x.braFlo.pep.all.fa.db.tsv']

In [24]:
petMar2_cdna_x_petMar2 = blasttools.blast_to_df(wdir('petMar2.cdna.fa.x.petMar2.fa.db.tsv'))
fix_blast_coords_df(petMar2_cdna_x_petMar2)
store['petMar2.cdna.fa.x.petMar2.fa.db.tsv'] = petMar2_cdna_x_petMar2



In [25]:
lamp10_blast_filter_df = lamp10_best_hits.minor_xs('evalue') >= 0

In [26]:
lamp10_ortho_filter_df = lamp10_ortho.minor_xs('evalue_x') >= 0

In [27]:
store['lamp10_blast_filter_df'] = lamp10_blast_filter_df

In [28]:
store['lamp10_ortho_filter_df'] = lamp10_ortho_filter_df

In [29]:
tissue_tr_df = (tpm_df > 0).groupby(by=sample_df.sort(columns='label').tissue.values, axis=1).sum()

In [30]:
store['tissue_tr_df'] = tissue_tr_df

In [31]:
store.close()