# Dataset Tutorial
This notebook gives a short tutorial on how to use the pytorch dataset I implemented.
You find the code in `src/data/dataset.py` in the `CnvDataset` class. 

Let's first start by importing some packages we might need

In [44]:
import pandas as pd
from pathlib import Path

# add this to you notebook so it automatically reloads code you changed in a
# python file after importing this code
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Now we also need to import the `CnvDataset` class.
Since the path (relative to the git repository root) to this notebook is `preprocessing/dataset_example.ipynb`, we need to add the parent directory to our system path in order to import software from there.
Think of it like this: We need to tell the notebook the relative path to the software folder `src` in order to import software from there.

In [45]:
import sys
sys.path.append('..') # add the parent directory to system path
from src.data.dataset import CnvDataset

Great :)
Now that this is out of the way, let's define some important paths for files we need to work with.

In [46]:
# directories we will need
git_root = Path('..')
data_root = git_root / 'data'
assert data_root.exists()

# dataset split files
b1_train_path = data_root / 'splits' / 'batch1_training_filtered.tsv'
b1_val_path = data_root / 'splits' / 'batch1_val_filtered.tsv'
b1_test_path = data_root / 'splits' / 'batch1_test_filtered.tsv'

Alright, now we are almost ready to use the `CnvDataset`.
One last thing thats is missing, is the path to the directory that stores the embedding files for the dataset we want to use.
All dataset paths follow the same pattern:
```
data/embeddings/batch_<batch_number>/<dataset_type>/<embedding_mode>
```
where:
* `batch_<batch_number>` is either `batch_1` or `batch_2`
* `<dataset_type>` is one of `train`, `val` or `test`
* `<embedding_mode>` is one of `single_gene_barcode`, `gene_concat` or `barcode_channel`

Please note, that the `<embedding_mode>` will be added automatically.
You don't need to add it to the dataset path, just change the `embedding_mode` parameter for the `CnvDataset` class.
Also, please make sure that the directory actually exists.
However, the python code will raise an exception if does not find any embedding files.

OK. Now let's define the dataset we want to use.
In this example I chose the validation set of batch 1.

In [47]:
dataset_root = data_root / 'embeddings' / 'batch_1' / 'validation'
dataset_root

PosixPath('../data/embeddings/batch_1/validation')

Next, we read the validation split data frame using pandas.

In [48]:
b1_val_path = data_root / 'splits' / 'batch1_val_filtered.tsv'
b1_val_df = pd.read_csv(b1_val_path, sep='\t')
b1_val_df

Unnamed: 0,barcode,gene_id,expression_count,classification
0,AAAGGTTAGGGTGGAT-1,ENSG00000173372,0.407756,low
1,AAAGGTTAGGGTGGAT-1,ENSG00000226476,1.103188,high
2,AAAGGTTAGGGTGGAT-1,ENSG00000231252,1.257665,high
3,AAAGGTTAGGGTGGAT-1,ENSG00000229956,0.696581,low
4,AAAGGTTAGGGTGGAT-1,ENSG00000188641,0.407756,low
...,...,...,...,...
8949,TTGGCTACATAAGTTC-1,ENSG00000198938,2.065108,high
8950,TTGGCTACATAAGTTC-1,ENSG00000198840,1.721116,high
8951,TTGGCTACATAAGTTC-1,ENSG00000198886,2.611877,high
8952,TTGGCTACATAAGTTC-1,ENSG00000198786,1.907831,high


In [49]:
b1_val_dataset = CnvDataset(
    root=dataset_root,
    data_df=b1_val_df
)

Using 51 barcodes
Using 1093 genes
No embedding files for 988 data points in ../data/embeddings/batch_1/validation/single_gene_barcode!


Your output should look something like:
```
Using 51 barcodes
Using 1093 genes
No embedding files for 932 data points in ../data/embeddings/batch_1/val/single_gene_barcode!
```

This means that from the datapoint with target values in the dataset, we are missing 932 embedding files.

Now you should be able to get the number of data points and the first rows of the data frame by using the string representaiton of the dataset variable.

In [50]:
print(str(b1_val_dataset))

<class 'src.data.dataset.CnvDataset'> with 7966 datapoints
              barcode          gene_id  expression_count classification  \
0  AAAGGTTAGGGTGGAT-1  ENSG00000173372          0.407756            low   
1  AAAGGTTAGGGTGGAT-1  ENSG00000226476          1.103188           high   
2  AAAGGTTAGGGTGGAT-1  ENSG00000231252          1.257665           high   
3  AAAGGTTAGGGTGGAT-1  ENSG00000229956          0.696581            low   
4  AAAGGTTAGGGTGGAT-1  ENSG00000188641          0.407756            low   

                                      embedding_path  
0  ../data/embeddings/batch_1/validation/single_g...  
1  ../data/embeddings/batch_1/validation/single_g...  
2  ../data/embeddings/batch_1/validation/single_g...  
3  ../data/embeddings/batch_1/validation/single_g...  
4  ../data/embeddings/batch_1/validation/single_g...  


Also we should be able to get the embedding and the classification label from the dataset using an index (just like a list).

In [8]:
b1_val_dataset[0]

(tensor([[0., 0., 1.,  ..., 1., 0., 1.],
         [0., 1., 0.,  ..., 0., 0., 0.],
         [1., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([0.]))

In [9]:
embedding, target = b1_val_dataset[0]
print(type(embedding))
print(type(target))

<class 'torch.Tensor'>
<class 'torch.Tensor'>


To use the dataset for regression you need to set the `target_type` parameter for the `CnvDataset` class to `'regression'`.

In [10]:
b1_val_dataset = CnvDataset(
    root=dataset_root,
    data_df=b1_val_df,
    target_type='regression'
)

Using 51 barcodes
Using 1093 genes
No embedding files for 932 data points in ../data/embeddings/batch_1/val/single_gene_barcode!


In [51]:
b1_val_dataset[0]

(tensor([[0., 0., 0.,  ..., 1., 1., 1.],
         [0., 1., 0.,  ..., 1., 1., 1.],
         [0., 0., 0.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([0.]))

Now let's load the training set for a regression use case.

In [22]:
b1_train_path = data_root / 'splits' / 'batch1_training_filtered.tsv'
train_set_root = data_root / 'embeddings' / 'batch_1' / 'training'
b1_train_df = pd.read_csv(b1_train_path, sep='\t')
b1_train_dataset = CnvDataset(
    root=train_set_root,
    data_df=b1_train_df,
    target_type='regression'
)

Using 356 barcodes
Using 1595 genes
No embedding files for 5153 data points in ../data/embeddings/batch_1/training/single_gene_barcode!


In [23]:
b1_train_dataset

<class 'src.data.dataset.CnvDataset'> with 54188 datapoints

In [35]:
b1_train_dataset[42]

(tensor([[1., 0., 0.,  ..., 0., 0., 1.],
         [0., 1., 0.,  ..., 0., 1., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([0.7499]))

And here a example using the test dataset for classification.

In [14]:
test_set_root = data_root / 'embeddings' / 'batch_1' / 'test'
b1_test_df = pd.read_csv(b1_test_path, sep='\t')
b1_test_dataset = CnvDataset(
    root=test_set_root,
    data_df=b1_test_df,
    target_type='regression'
)

Using 102 barcodes
Using 1235 genes
No embedding files for 3795 data points in ../data/embeddings/batch_1/test/single_gene_barcode!


In [15]:
b1_test_dataset

<class 'src.data.dataset.CnvDataset'> with 14840 datapoints

## Iterating over the Dataset
There are two common ways to iterate over a Dataset.
1. use a for-loop over the dataset
2. use a for-loop over `range(len(dataset))` and get the data point per index.
3. use a DataLoader from `pytorch.utils.data.DataLoader`.

Let's start by using the same dataset as in the beginning of this tutorial.

In [53]:
# this time I am specifically requsting numpy results from the dataset
b1_val_dataset = CnvDataset(
    root=data_root / 'embeddings' / 'batch_1' / 'validation',
    data_df=pd.read_csv(data_root / 'splits' / 'batch1_val_filtered.tsv', sep='\t'),
    return_numpy=True
)

Using 51 barcodes
Using 1093 genes
No embedding files for 988 data points in ../data/embeddings/batch_1/validation/single_gene_barcode!


Now, let's start by making a for loop over the Dataset.

In [54]:
# this code prints the row sum for the first 8 embeddings with either ATAC,
#  CNV loss or CNV gain
i = 0
for embedding, target in b1_val_dataset:
    t_sum = embedding.sum(axis=1)
    if any(t_sum[4:] > 0):
        i += 1
        print(t_sum)
    if i > 7:
        break

[ 2990  2003  1999  3008     0 10000     0]
[ 7168  7450  7334  7113 10000     0     0]
[4260 4412 4196 4284 1822    0    0]
[ 2312  2492  2553  2643 10000     0     0]
[ 2789  2188  2130  2893     0     0 10000]
[ 2976  1906  1958  3160     0     0 10000]
[2998 3817 4065 3152 2488    0    0]
[2641 2308 2475 2576 1748    0    0]


Next, let's use the the index specific access with a range in the for loop.
This you can also use to iterate over a specific set of indices in the dataset.

In [43]:
start = 42
for i in range(start, len(b1_val_dataset)):
    embedding, target = b1_val_dataset[i]
    print(embedding.shape)
    print(target)
    if i > 7 + start:
        break

(7, 10000)
[1.]
(7, 10000)
[0.]
(7, 10000)
[0.]
(7, 10000)
[0.]
(7, 10000)
[1.]
(7, 10000)
[0.]
(7, 10000)
[0.]
(7, 10000)
[0.]
(7, 10000)
[1.]


Lastly, you can use a `pytorch.utils.data.DataLoader` to iterate through the dataset.
This is possible, because the `CnvDataset` class inherits from `pytorch.utils.data.Dataset`.
See an example usage of a dataloader below.

In [18]:
from torch.utils.data import DataLoader

In [19]:
# the batch size determines the group size per iteration
b1_val_loader = DataLoader(b1_val_dataset, batch_size=3)

In [20]:
for i, batch in enumerate(b1_val_loader):
    embeddings, targets = batch
    print(embeddings.shape)
    print(targets.shape)
    if i > 7:
        break

torch.Size([3, 7, 10000])
torch.Size([3, 1])
torch.Size([3, 7, 10000])
torch.Size([3, 1])
torch.Size([3, 7, 10000])
torch.Size([3, 1])
torch.Size([3, 7, 10000])
torch.Size([3, 1])
torch.Size([3, 7, 10000])
torch.Size([3, 1])
torch.Size([3, 7, 10000])
torch.Size([3, 1])
torch.Size([3, 7, 10000])
torch.Size([3, 1])
torch.Size([3, 7, 10000])
torch.Size([3, 1])
torch.Size([3, 7, 10000])
torch.Size([3, 1])


As you can see, even though `CnvDataset` returns numpy arrays, the Dataloader converts thess into pytorch tensors.

## Computing Embeddings from scratch
This section covers (re-)computing embeddings using the `CnvDataset`.
For this we need a little bit more information than before.
Like previouly, let's start by defining some paths to relevant files and directories. 

In [6]:
# directories we will need
out_root = git_root / 'out'

# files we will need
genome_fasta = data_root / 'reference' / 'GRCh38.d1.vd1.fa'
assert genome_fasta.exists()
gtf_path=data_root / 'gene_positions_and_overlaps' / 'gene_positions.csv'
assert gtf_path.exists()
overlap_path = data_root / 'gene_positions_and_overlaps' / 'overlaps_batch1.tsv'
assert overlap_path.exists()
epiAneufinder_path = out_root / 'epiAneufinder' / 'epiAneuFinder_results.tsv'
assert epiAneufinder_path.exists()

In [11]:
b2_train_path = data_root / 'splits' / 'batch2_training_filtered.tsv'
b2_val_path = data_root / 'splits' / 'batch2_val_filtered.tsv'
b2_test_path = data_root / 'splits' / 'batch2_test_filtered.tsv'

In [16]:
b2_df = pd.read_csv(b2_val_path, sep='\t')

In [17]:
# compute all embeddings for batch 1
b1_dataset = CnvDataset(
    root=data_root / 'embeddings' / 'batch_2' / 'val_pt' ,
    data_df=b2_df,
    fasta_path=genome_fasta,
    gtf_path=gtf_path,
    atac_path=overlap_path,
    cnv_path=epiAneufinder_path,
    force_recompute=True,
    file_format='pt',
    verbose=3
)

Using 35 barcodes
GCTAGCTCATCCCGCT-2,AAACCAACATTGCGGT-2,ACTAACTCATAAGTTC-2,CTGTACCTCATTGCGG-2,TACGGATTCGGTCAGC-2,CCTTCAATCTCACACC-2,TTAGGCCCAGACAAAC-2,TATATCCTCGTTACTT-2,TTTAACCTCCTGGTGA-2,ACTTTGTTCTCAATTC-2,TATGACATCAAACACC-2,TTTGTGGCATGAATAG-2,CGGAGTCTCCTCATCA-2,GGACAGCCATCCATCT-2,CAGCCTTTCCCTGGAA-2,GACGCAACATTAGGTT-2,TACGCTTGTTAAATGC-2,TCAAGAACAGCAAGAT-2,GGTCAGGAGTTCCCGT-2,TCTTCAAGTGACCTGG-2,AATTGCTCAGGAATCG-2,ACGCCTAAGCCTGGTA-2,GACCTGCAGATAACCC-2,CTGAAACTCATTTGCT-2,GGAACAATCTTGTCCA-2,TCCATTGTCCCTCACG-2,AGGCTAGCAAACCCTA-2,TACTTCGTCCCATAAA-2,GCGTTTCTCTGCAACG-2,CACTTTGTCTAATCAG-2,GCACGAACAACCTAAT-2,GCTAATATCAAGCCTG-2,GTCCTCCCATAGCTGC-2,TCAACAATCTTAGGAC-2,AGGCGGATCATGAAGG-2
Using 621 genes
ENSG00000153208,ENSG00000253868,ENSG00000175175,ENSG00000184828,ENSG00000154655,ENSG00000228142,ENSG00000176771,ENSG00000167978,ENSG00000004846,ENSG00000135821,ENSG00000162946,ENSG00000130653,ENSG00000166926,ENSG00000127946,ENSG00000140284,ENSG00000127507,ENSG00000184226,ENSG00000107611,ENSG000001650

KeyboardInterrupt: 

Just like with precomputed dataset, you can no access any data point by index.

In [20]:
b1_dataset[0]

{'embedding': tensor([[0, 1, 0,  ..., 0, 1, 0],
         [0, 0, 1,  ..., 0, 0, 0],
         [1, 0, 0,  ..., 1, 0, 1],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]], dtype=torch.uint8),
 'target': 'low'}

You'll also get a brief overview of the dataset when printing.

In [21]:
print(b1_dataset)

<class 'src.data.dataset.CnvDataset'> with 8022 datapoints
                barcode          gene_id  expression_count classification  \
27   GACGTAAAGCATGTTA-1  ENSG00000069424          0.606885            low   
69   CCTTAACGTCGTAAAT-1  ENSG00000215788          0.638693            low   
82   GCTATCCTCCCTCGCA-1  ENSG00000215788          0.216336            low   
87   TACGGTTAGCACAGCC-1  ENSG00000215788          0.569907            low   
131  GATTTGCAGCCTGTTC-1  ENSG00000171621          0.346812            low   

                                        embedding_path  
27   ../data/embeddings/batch_1/val_redo2/single_ge...  
69   ../data/embeddings/batch_1/val_redo2/single_ge...  
82   ../data/embeddings/batch_1/val_redo2/single_ge...  
87   ../data/embeddings/batch_1/val_redo2/single_ge...  
131  ../data/embeddings/batch_1/val_redo2/single_ge...  
