# Dataset Tutorial
This notebook gives a short tutorial on how to use the pytorch dataset I implemented.
You find the code in `src/data/dataset.py` in the `CnvDataset` class. 

Let's first start by importing some packages we might need

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

# add this to you notebook so it automatically reloads code you changed in a
# python file after importing this code
%load_ext autoreload
%autoreload 2

Now we also need to import the `CnvDataset` class.
Since the path (relative to the git repository root) to this notebook is `preprocessing/dataset_example.ipynb`, we need to add the parent directory to our system path in order to import software from there.
Think of it like this: We need to tell the notebook the relative path to the software folder `src` in order to import software from there.

In [31]:
import sys
sys.path.append('..') # add the parent directory to system path
from src.data.dataset import CnvDataset

Great :)
Now that this is out of the way, let's define some important paths for file we need to work with.

In [3]:
# directories we will need
git_root = Path('..')
data_root = git_root / 'data'
assert data_root.exists()

# dataset split files
b1_train_path = data_root / 'splits' / 'batch1_training_filtered.tsv'
b1_val_path = data_root / 'splits' / 'batch1_val_filtered.tsv'
b1_test_path = data_root / 'splits' / 'batch1_test_filtered.tsv'

Alright, now we are almost ready to use the `CnvDataset`.
One last thing thats is missing, is the path to the directory that stores the embedding files for the dataset we want to use.
All dataset paths follow the same pattern:
```
data/embeddings/batch_<batch_number>/<dataset_type>/<embedding_mode>
```
where:
* `batch_<batch_number>` is either `batch_1` or `batch_2`
* `<dataset_type>` is one of `train`, `val` or `test`
* `<embedding_mode>` is one of `single_gene_barcode`, `gene_concat` or `barcode_channel`

Please note, that the `<embedding_mode>` will be added automatically.
You don't need to add it to the dataset path, just change the `"embedding_mode"` parameter for the `CnvDataset` class.
Also, please make sure that the directory actually exists.
However, the python code will raise an exception if does not find any embedding files.

OK. Now let's define the dataset we want to use.
In this example I chose the validation set of batch 1.

In [13]:
dataset_root = data_root / 'embeddings' / 'batch_1' / 'val'

Next, we read the validation split data frame using pandas.

In [14]:
b1_val_df = pd.read_csv(b1_val_path, sep='\t')

In [32]:
# compute all embeddings for batch 1
b1_val_dataset = CnvDataset(
    root=dataset_root,
    data_df=b1_val_df,
    embedding_mode='single_gene_barcode' # if you want a different embedding mode, change the parameter here
)

Using 51 barcodes
Using 1093 genes
No embedding files for 932 data points in ../data/embeddings/batch_1/val/single_gene_barcode!
Found 38082 unused embedding files in ../data/embeddings/batch_1/val/single_gene_barcode!


Now you should be able to get the number of data points and the first rows of the data frame by using the string representaiton of the dataset variable.

In [34]:
print(str(b1_val_dataset))

<class 'src.data.dataset.CnvDataset'> with 8022 datapoints
               barcode          gene_id  expression_count classification  \
15  AAAGGTTAGGGTGGAT-1  ENSG00000020577          0.407756            low   
17  AAAGGTTAGGGTGGAT-1  ENSG00000021645          2.146118           high   
26  AAAGGTTAGGGTGGAT-1  ENSG00000030582          0.407756            low   
27  AAAGGTTAGGGTGGAT-1  ENSG00000033327          0.696581            low   
32  AAAGGTTAGGGTGGAT-1  ENSG00000038427          1.103188           high   

                                       embedding_path  
15  ../data/embeddings/batch_1/val/single_gene_bar...  
17  ../data/embeddings/batch_1/val/single_gene_bar...  
26  ../data/embeddings/batch_1/val/single_gene_bar...  
27  ../data/embeddings/batch_1/val/single_gene_bar...  
32  ../data/embeddings/batch_1/val/single_gene_bar...  


Also we should be able to get the embedding and the classification label from the dataset using an index (just like a list).

In [35]:
b1_val_dataset[0]

{'embedding': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 1,  ..., 0, 1, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]], dtype=torch.uint8),
 'label': 'low'}

## Computing Embeddings from scratch
This section covers (re-)computing embeddings using the `CnvDataset`.
For this we need a little bit more information than before.
Like previouly, let's start by defining some paths to relevant files and directories. 

In [9]:
# directories we will need
out_root = git_root / 'out'

# files we will need
genome_fasta = data_root / 'reference' / 'GRCh38.d1.vd1.fa'
assert genome_fasta.exists()
overlap_path = data_root / 'overlap_genes_peaks.tsv'
assert overlap_path.exists()
epiAneufinder_path = out_root / 'epiAneufinder' / 'epiAneuFinder_results.tsv'
assert epiAneufinder_path.exists()

In [7]:
b1_df = pd.read_csv(b1_val_path, sep='\t')

In [11]:
# compute all embeddings for batch 1
b1_dataset = CnvDataset(
    root=data_root / 'embeddings' / 'batch_1' / 'val' ,
    data_df=b1_df,
    fasta_path=genome_fasta,
    atac_path=overlap_path,
    cnv_path=epiAneufinder_path,
    embedding_mode='single_gene_barcode',
    force_recompute=True
)

Using 51 barcodes
Using 1093 genes
Recomputing embeddings:  True
[embed]: Iterating over all possible barcode-gene combinations
[embed]: Computing Embeddings with mode: "single_gene_barcode"
[embed]: Using 51 barcodes
[embed]:Using 904 genes


[embed]: Computing embeddings:   0%|                                                          | 0/46104 [00:00<?, ?it/s]

saving embedding
 [[0. 1. 0. ... 0. 1. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 1. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/AAAGGTTAGGGTGGAT-1/ENSG00000069424.mtx
saving embedding
 [[0. 1. 0. ... 0. 1. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 1. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/AACAGCAAGCAGGTGG-1/ENSG00000069424.mtx
saving embedding
 [[0. 1. 0. ... 0. 1. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 1. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/AAGCCTCCACGAACAG-1/ENSG00000069424.mtx
saving embedding
 [[0. 1. 0. ... 0. 1. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 1. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to pat

[embed]: Computing embeddings:   0%|                                               | 1/46104 [00:01<17:14:59,  1.35s/it]

saving embedding
 [[0. 1. 0. ... 0. 1. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 1. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/GCTATCCTCCCTCGCA-1/ENSG00000069424.mtx
saving embedding
 [[0. 1. 0. ... 0. 1. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 1. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/GGACCGAAGCGATAAG-1/ENSG00000069424.mtx
saving embedding
 [[0. 1. 0. ... 0. 1. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 1. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/GGGCTAACAGCATGTC-1/ENSG00000069424.mtx
saving embedding
 [[0. 1. 0. ... 0. 1. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 1. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to pat

[embed]: Computing embeddings:   0%|                                                | 2/46104 [00:01<9:44:01,  1.32it/s]

saving embedding
 [[0. 1. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 1. ... 0. 1. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/TCGTTATTCCTTGCAC-1/ENSG00000215788.mtx
saving embedding
 [[0. 1. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 1. ... 0. 1. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/TCTCAAGCAACCCTCC-1/ENSG00000215788.mtx
saving embedding
 [[0. 1. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 1. ... 0. 1. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/TCTCGCCCAGGCGATA-1/ENSG00000215788.mtx
saving embedding
 [[0. 1. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 1. ... 0. 1. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to pat

[embed]: Computing embeddings:   0%|                                               | 2/46104 [00:01<12:34:44,  1.02it/s]

saving embedding
 [[1. 0. 1. ... 0. 1. 1.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/CGCTTAACATCACAGC-1/ENSG00000171621.mtx
saving embedding
 [[1. 0. 1. ... 0. 1. 1.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/CGTTAGTAGTTCCCGT-1/ENSG00000171621.mtx
saving embedding
 [[1. 0. 1. ... 0. 1. 1.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to path ../data/embeddings/batch_1/val/single_gene_barcode/CTGGACCAGTTGGGCC-1/ENSG00000171621.mtx
saving embedding
 [[1. 0. 1. ... 0. 1. 1.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] 
to pat




KeyboardInterrupt: 

In [None]:
b1_dataset

[autoreload of src.data.dataset failed: Traceback (most recent call last):
  File "/vol/storage/shared/miniforge3/envs/cmscb/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 276, in check
    superreload(m, reload, self.old_objects)
  File "/vol/storage/shared/miniforge3/envs/cmscb/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 475, in superreload
    module = reload(module)
             ^^^^^^^^^^^^^^
  File "/vol/storage/shared/miniforge3/envs/cmscb/lib/python3.12/importlib/__init__.py", line 131, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 866, in _exec
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/vol/storage/shared/cmscb8/preprocessing/../src/data/dataset.py", line 19, in <module>
    class CnvDataset(torch.utils.data.Dataset):
  File "/vol/storage/shared/cmscb8/preprocessing/../s

<class 'src.data.dataset.CnvDataset'> with 56497 datapoints

In [None]:
368 * 1451

533968

In [None]:
49 * 1153

56497

In [None]:
b1_df

Unnamed: 0,barcode,gene_id,expression_count,classification,embedding_path
0,AAACATGCAGGATGGC-1,ENSG00000084070,0.371921,low,../data/embeddings/batch_1/val/single_gene_bar...
1,AAACATGCAGGATGGC-1,ENSG00000127124,0.642400,low,../data/embeddings/batch_1/val/single_gene_bar...
2,AAACATGCAGGATGGC-1,ENSG00000269113,1.179453,high,../data/embeddings/batch_1/val/single_gene_bar...
3,AAACATGCAGGATGGC-1,ENSG00000173406,0.371921,low,../data/embeddings/batch_1/val/single_gene_bar...
4,AAACATGCAGGATGGC-1,ENSG00000226476,0.371921,low,../data/embeddings/batch_1/val/single_gene_bar...
...,...,...,...,...,...
49311,TTTGTTGGTACCAGGT-1,ENSG00000198899,2.295555,high,../data/embeddings/batch_1/val/single_gene_bar...
49312,TTTGTTGGTACCAGGT-1,ENSG00000198938,1.884219,high,../data/embeddings/batch_1/val/single_gene_bar...
49313,TTTGTTGGTACCAGGT-1,ENSG00000198840,0.749642,low,../data/embeddings/batch_1/val/single_gene_bar...
49314,TTTGTTGGTACCAGGT-1,ENSG00000198886,1.698359,high,../data/embeddings/batch_1/val/single_gene_bar...


In [None]:
b1_dataset.data_df

Unnamed: 0,barcode,gene_id,expression_count,classification,embedding_path


In [None]:
[d for d in b1_dataset.root_path.iterdir()][:10]

[PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/AGCCGCTAGAATGACG-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/CATTATCTCGCGACAC-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/CATGAGGCACGGTTTA-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/GCTAAGTTCGGGCCAT-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/AGTTGCGTCGATATTG-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/AATTACCCAGCAAGTG-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/GGACCGAAGCGATAAG-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/ACCTACCTCGGCTAGC-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/TGATGAACAAGGCCAA-1'),
 PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/TGAACAGAGCACTTGG-1')]

In [None]:
print(b1_dataset)

<class 'src.data.dataset.CnvDataset'> with 0 datapoints


In [None]:
b1_df.iloc[0]['embedding_path']

PosixPath('../data/embeddings/batch_1/val/single_gene_barcode/AAACATGCAGGATGGC-1/ENSG00000084070.mtx')