# Generating a report from Illumina sequencing run

For all the Illumina sequencing we've been doing, I keep a google spreadsheet where I record some metrics from each run, such as the date, number of reads passed filter, the phiX spike-in percentage and read error rates. To do this, I've been manually searching for data on basespace and copying it into the spreadsheet. This process is both annoying and error prone, as it's easy to mix up numbers when copying. These metrics are stored under the InterOp folder in the run directory, but these are all binary files, and until now I'd given up trying to parse them. Today I learned of the existence of the Illumina [interop](https://github.com/Illumina/interop) C++ library to parse these files. More importantly, this C++ library has a python API, and so is actually quite easy to use to pull info out of these interop files. 

Here I show how to access some basic run metrics using the interop files from the run, and do something useful with this data.

## Example from Illumina tutorial

This example is mostly taken from the python [tutorial](https://github.com/Illumina/interop/blob/master/docs/src/Tutorial_01_Intro.ipynb) available on Illumina's GitHub page.

In [1]:
from interop import py_interop_run_metrics, py_interop_run, py_interop_summary

In [23]:
import pandas as pd

In [61]:
import numpy as np

In [29]:
run_folder = '/Users/timstuart/Desktop/interop'

In [30]:
run_metrics = py_interop_run_metrics.run_metrics()

In [31]:
run_folder = run_metrics.read(run_folder, 0)

In [32]:
summary = py_interop_summary.run_summary()

In [33]:
py_interop_summary.summarize_run_metrics(run_metrics, summary)

In [36]:
columns = ( ('Yield Total (G)', 'yield_g'),
           ('Projected Yield (G)', 'projected_yield_g'),
           ('% Aligned', 'percent_aligned'))
rows = [('Non-Indexed Total', summary.nonindex_summary()),
        ('Total', summary.total_summary())]
d = []
for label, func in columns:
    d.append( (label, pd.Series([getattr(r[1], func)() for r in rows], index=[r[0] for r in rows])))

pd.DataFrame.from_items(d)

Unnamed: 0,Yield Total (G),Projected Yield (G),% Aligned
Non-Indexed Total,83.205124,83.205132,18.244617
Total,87.373581,87.373596,18.244617


In [37]:
rows = [("Read %s%d"%("(I)" if summary.at(i).read().is_index()  else " ", summary.at(i).read().number()), summary.at(i).summary()) for i in range(summary.size())]
d = []
for label, func in columns:
    d.append( (label, pd.Series([getattr(r[1], func)() for r in rows], index=[r[0] for r in rows])))

pd.DataFrame.from_items(d)

Unnamed: 0,Yield Total (G),Projected Yield (G),% Aligned
Read 1,14.889029,14.889029,19.393942
Read (I)2,4.168461,4.168461,0.0
Read 3,68.316093,68.316101,17.095293


## Extracting total reads passed filter

First get a summary by lane

In [84]:
summary = py_interop_summary.index_lane_summary()

Setting lane number to 0 will average over all the lanes (see github issue [here](https://github.com/Illumina/interop/issues/146#issuecomment-331475939))

In [85]:
py_interop_summary.summarize_index_metrics(run_metrics,0,summary)
summary.total_fraction_mapped_reads()

75.96920013427734

## Extracting read 2 error 