# Moore Foundation Site Visit
## Camille Scott: "What is it I do, anyway?"
### 2015-09-15
### UC Davis

In [3]:
from IPython.display import HTML

### 'Literate Bioinformatics'

* Data processing logistics are crucial to understanding results in bioinformatics; they are rarely adequately reported. Pydoit!
  - http://pydoit.org/
  - https://github.com/dib-lab/2015-petMarSB <-- This is public!

In [2]:
def samtools_sort_task(bam_fn):
    
    cmd = 'samtools sort -n {bam_fn} {bam_fn}.sorted'.format(**locals())

    name = 'samtools_sort_{bam_fn}'.format(**locals())

    return {'name': name,
            'title': title_with_actions,
            'actions': [cmd],
            'targets': [bam_fn + '.sorted.bam'],
            'file_dep': [bam_fn],
            'clean': [clean_targets] }

* Purpose-built services with limited scope are more useful and useable than monolithic ones. mygene for programmatic exploration!
    - http://mygene.info/

In [9]:
import mygene
mg = mygene.MyGeneInfo()

mg.query('HOX', fields='entrezgene,symbol', species='zebrafish', size=5, verbose=False)

{u'hits': [{u'_id': u'30341',
   u'_score': 0.8095548,
   u'entrezgene': 30341,
   u'symbol': u'hoxb6a'},
  {u'_id': u'30317',
   u'_score': 0.7813504,
   u'entrezgene': 30317,
   u'symbol': u'hoxb5a'},
  {u'_id': u'100006598',
   u'_score': 0.63142645,
   u'entrezgene': 100006598,
   u'symbol': u'hoxd12a'},
  {u'_id': u'30379',
   u'_score': 0.63142645,
   u'entrezgene': 30379,
   u'symbol': u'hoxc5a'},
  {u'_id': u'30404',
   u'_score': 0.5724417,
   u'entrezgene': 30404,
   u'symbol': u'hoxd10a'}],
 u'max_score': 0.8095548,
 u'took': 5,
 u'total': 9}

* Looking at individual projects as iterative processes requires 1) and 2).  pweave connects the processing with the description and let's me crank the wheel faster.
    - https://github.com/dib-lab/2015-petMarSB/blob/master/doc/2015-petMarSB.draft.texw
    - https://github.com/dib-lab/2015-petMarSB/blob/master/doc/2015-petMarSB.draft.tex
    - https://github.com/dib-lab/2015-petMarSB/blob/master/doc/2015-petMarSB.draft.pdf

In [38]:
def busco_task(input_filename, output_dir, busco_db_dir, input_type, busco_cfg):
    
    name = '_'.join(['busco', input_filename, os.path.basename(busco_db_dir)])

    assert input_type in ['genome', 'OGS', 'trans']
    n_threads = busco_cfg['n_threads']
    busco_path = busco_cfg['path']

    cmd = 'python3 {busco_path} -in {in_fn} -o {out_dir} -l {db_dir} '\
            '-m {in_type} -c {n_threads}'.format(busco_path=busco_path, 
            in_fn=input_filename, out_dir=output_dir, db_dir=busco_db_dir, 
            in_type=input_type, n_threads=n_threads)

    return {'name': name,
            'title': title_with_actions,
            'actions': [cmd],
            'targets': ['run_' + output_dir, 
                        os.path.join('run_' + output_dir, 'short_summary_' + output_dir.rstrip('/'))],
            'file_dep': [input_filename],
            'uptodate': [run_once],
            'clean': [(clean_folder, ['run_' + output_dir])]}

In [22]:
dat = {"fn":{"0":"lamp10.fasta","1":"lamp10.fasta","2":"petMar2.cdna.fa","3":"petMar2.cdna.fa"},"db":{"0":"metazoa","1":"vertebrata","2":"metazoa","3":"vertebrata"},"C(%)":{"0":"66","1":"38","2":"48","3":"28"},"D(%)":{"0":"43","1":"23","2":"6.4","3":"2.0"},"F(%)":{"0":"27","1":"10","2":"15","3":"5.1"},"M(%)":{"0":"5.9","1":"50","2":"36","3":"66"},"n":{"0":"843","1":"3023","2":"843","3":"3023"}}
dat

{'C(%)': {'0': '66', '1': '38', '2': '48', '3': '28'},
 'D(%)': {'0': '43', '1': '23', '2': '6.4', '3': '2.0'},
 'F(%)': {'0': '27', '1': '10', '2': '15', '3': '5.1'},
 'M(%)': {'0': '5.9', '1': '50', '2': '36', '3': '66'},
 'db': {'0': 'metazoa', '1': 'vertebrata', '2': 'metazoa', '3': 'vertebrata'},
 'fn': {'0': 'lamp10.fasta',
  '1': 'lamp10.fasta',
  '2': 'petMar2.cdna.fa',
  '3': 'petMar2.cdna.fa'},
 'n': {'0': '843', '1': '3023', '2': '843', '3': '3023'}}

In [37]:
import buscotools as bt
import pandas as pd

df = pd.DataFrame(dat)
bt.to_latex(df)

u'\\begin{tabular}{lllllllllll}\n\\toprule\n{} & metazoa &      &      &      &      & vertebrata &      &      &      &       \\\\\n{} &    C(\\%) & D(\\%) & F(\\%) & M(\\%) &    n &       C(\\%) & D(\\%) & F(\\%) & M(\\%) &     n \\\\\n\\midrule\nfn              &         &      &      &      &      &            &      &      &      &       \\\\\nlamp10.fasta    &      66 &   43 &   27 &  5.9 &  843 &         38 &   23 &   10 &   50 &  3023 \\\\\npetMar2.cdna.fa &      48 &  6.4 &   15 &   36 &  843 &         28 &  2.0 &  5.1 &   66 &  3023 \\\\\n\\bottomrule\n\\end{tabular}\n'

Putting it all together: we can construct complex analyses from these simple building blocks. This has been put to practice for the [lamprey transcriptome](http://nbviewer.ipython.org/github/dib-lab/2015-petMarSB/blob/master/notebooks/petmar-genome-completeness.ipynb).

![proposed craniate phylogenies](http://icb.oxfordjournals.org/content/50/1/130/F2.medium.gif)
