# Cover usage examples

Cover objects can be used to extract genomics coverage data for fitting machine learning models.

However, the potential usage of the Cover object depends on multiple factors, including whether you want to
use it to serve input coverage tracks or output labels or signal. Whether you want to use the coverage as base pair-resolution coverage
or whether you want to extract aggregate scores describing a region.

In this tutorial, we shall go over some of the most important usage examples.

## Extract output labels/signal from a Cover object

In [1]:
from janggu.data import Cover
from janggu.data import ReduceDim

ModuleNotFoundError: No module named 'janggu'

In [None]:
A common use case is to extract genomic feature tracks 
such that each interval is represented by an aggregate score.

This use case will involve feature extraction using Cover followed by reshaping
using the ReduceDim wrapper to generate 2D data structures.

In [None]:
First, we assume that the data has already been aranged in equally sized intervals.

In [None]:
Given a set of equally sized intervals, 
we want to extract a single score/label for each interval
and use these as training labels.

In [None]:
There are multiple ways to achieve this end.

In [None]:
We can extract the coverage tracks in three ways:

In [None]:
Option 1: Compatible with partial genome coverage. 
    works with overlapping and non-overlapping intervals.

In [None]:
c1 = Cover.create_from_bed('signal', bedfiles=, roi=,
                           resolution=None)

c2 = Cover.create_from_bigwig('signal', bedfiles=, roi=,
                              resolution=None)

c3 = Cover.create_from_bam('signal', bedfiles=, roi=,
                           resolution=None)

In [None]:
c1.shape, c2.shape, c3.shape

In [None]:
#Option 2:

In [None]:
#c1 = Cover.create_from_bed('signal', bedfiles=, roi=,
#                           resolution=500)

#c2 = Cover.create_from_bigwig('signal', bedfiles=, roi=,
#                              resolution=500)

#c3 = Cover.create_from_bam('signal', bedfiles=, roi=,
#                           resolution=500)

In [None]:
#c1.shape, c2.shape, c3.shape

In [None]:
The result of these operations will yield a coverage tracks with the shape (nbatch, 1, 1, ncondition).
In order to convert these signal tracks to a 2D table-like data structure, we can wrap the
coverage tracks using the ReduceDim class.

In [None]:
rc1 = ReduceDim(c1)

In [None]:
print(rc1)

In [2]:
rc1.shape

NameError: name 'rc1' is not defined

In [None]:
If you want to work with the whole-genome coverage tracks. There are two ways:

In [None]:
First, we extract the coverage signal in base-pair resolution and subsequently apply the ReduceDim wrapper
which aggregates the signal to a single score  per interval.

In [None]:
c2 = ReduceDim(Cover.create_from_bed('signal', bedfiles=, roi=,
                                resolution=1), aggregate='max')

In [None]:
The downside of this approach is that
it may require quite a bit of memory, since the genome is extracted and stored in base-pair resolution.

In [None]:
A more memory friendly version of this can be achieved by specifying resolution=500
specifically. In this case, the genome is binned in non-overlapping 500bp regions
and only one value is stored per bin.
That is to say that the signal is stored in 500bp resolution.
The prerequisite for this approach is that the intervals contained in the roi 
must be aligned with the bins.

In [None]:
c2 = ReduceDim(Cover.create_from_bed('signal', bedfiles=, roi=,
                                resolution=500))

## Extract input coverage tracks from a Cover object

In [3]:
Of course, it may be desired that coverage tracks as supplied as input to a machine learning model.
For instance, for predicing gene expression from histon modification ChIP-seq tracks.

SyntaxError: invalid syntax (<ipython-input-3-8bd9208d6d49>, line 1)

In [None]:
In this situation, it is more common to extract the coverage in base-pair resolution coverage tracks
or by performing partial aggregation instead of aggregating the a single summary score for each interval.

In [None]:
base pair resolution coverage

In [None]:
partially aggregated coverage tracks