# Bellastore Demo

The purpose of this demo is to showcase the main functionality of this package.

The main purpose of `bellastore` is to organize whole slide image scans (WSI, one of the main data sources in digital pathology) both on a filesystem as well as on a database level.\
Therefore, `bellastore` creates and manages a database with ingress and storage tables and moves valid WSI files from ingress to storage within the file system and records this within the databases.

This ultimately leads to a comprehensive storage databse that can efficiently be queried in order to retrieve files from a storage containing possibly tens of thousands of WSIs.

On the other hand the ingress database serves the purpose of recording the origin of the files.\
Often clinical labs encode valuable information in folder and file names which are directly recorded in the ingress database.\
Thus, even after moving and renaming at the storage level, all the original source metadata is tracked.

Furthermore, the ingress serves the purpose of only allowing slides to the storage that are not already tracked.\
It still records a new possible duplicate in the ingress (as it could hold valuable metadata), but the WSI will not proceed to the storage.\
WSI identity is checked by hashing the scan file. For large scans this is the main time consuming point within the pipeline.

The beauty of `bellastore` is that the storage database allows for being extended in the spirit of relational databses with all sorts of metadata, like patient identifiers, clinical grades, cohort identifiers, etc.

The philosophy of the `bellastore` backend is that it keeps track of the file system via the `Fs` class and keeps this in sync with the databases by the inheriting `Db` class.

This demo showcases the workflow of `bellastore` by encapsulating the main integration test `tests/test_db_fs.py::test_classic`. 

## Setting up a mock ingress and storage

First of all we need to mock the state of the file system and the databases.

In the main usecase of `bellastore` we are in the following scenario:
- there are scans already recorded both in the ingress and storage databse
- the respective scans are present in the storage file system

In this demo we start from scratch and then in a second step mock a new cohort arriving from the clinic. 

In [None]:
import os
from pathlib import Path
from typing import List
from tempfile import TemporaryDirectory, TemporaryFile
import shutil 

from bellastore.utils.scan import Scan
from bellastore.database.db import Db

In [None]:
def create_scans(path: Path, amount=4) -> List[Scan]:
    '''
    Mocks scans on a specified path.

    The mock scans are just txt files containing content unique for each scan.
    However they carry the file ending .ndpi 

    Parameters
    -----------
    path : Path
        The shared directory holding the mocked scans
    amount : int
        The amount of scans to be created
    
    Returns
    --------
    List[Scans]
        The list of created scans
    '''
    scans = []
    for i in range(amount):
        p = path / f"scan_{i}.ndpi"
        p.write_text(f"Content of scan_{i}.ndpi", encoding="utf-8")
        scan = Scan(str(p))
        scans.append(scan)
    return scans

In [None]:
# the root of the fs holding both storage and ingress
root_dir = TemporaryDirectory().name

# create four scans in ingress
ingress_dir = Path(root_dir) / "new_scans"
os.makedirs(ingress_dir)
ingress_scans = create_scans(path=ingress_dir, amount=4)

In [None]:
# initialize the database holding ingress and storage table
db = Db(root_dir=root_dir, ingress_dir=ingress_dir, filename='scans.sqlite')
# The first part of the output shows the fs tree, and the second shows the Ingress and Storage tables.
# (Jupyter might out the tables as a single line)
print(str(db))

The `Db` class holds now all information of both the database as well as the file system.
- there are 4 scans in the ingress (`new_scans`)
- the storage contains only the database file `scans.sqlite`
- the database holds two empty tables `Ingress` and `Storage`

## Insert into storage

Now it is time to insert the scans from the ingress into the storage.

Note that this is a delicate process, requiring the following actions 💡:
- we need to check if the file in the ingress is a **valid** scan
    - if not it will stay in the ingress as it is -> BREAK
- **hash** the scan in order to make it comparable to existing scans
- **compare** the hash to the already recorded hashes in the **ingress table**
    - if there is an entry with identical hash, path and sanname, the file is removed from the fs -> BREAK
- **compare** hash to the the already recorded hashes in the **storage table**
    - if hash is already in storage table, record scan only in ingress table and then remove file from the fs -> BREAK
- **add** scan to the **storage databse**
    - move file into the storage directory
    - record scan in the storage table

In [None]:
valid_scans = db.insert_from_ingress()

The log message displays each state a slide follows according to the logic described above.

Now we can check if the filesystem and the databse is actually in the state that we expect them to be.

In [None]:
# storage contains four hashed slides
# ingress and storage table also hold exactly these slides
print(str(db))

So we now properly initialized our database and the filesystem now holds four scans 🙌

## Adding new and existing scans to the storage

In an application scenario, we will receive now a new cohort of scans. This batch might hold scans that are not present in the storage yet, as well as duplicates of the scans we already have.

In the following example the *new* cohort, is just the old cohort extended by two new scans.

In [None]:
ingress_dir = Path(root_dir) / "new_cohort"
os.makedirs(ingress_dir)
ingress_scans = create_scans(path=ingress_dir, amount=6)

When inserting now from ingress, we expect that **all** scans of the new cohort will be recorded in the ingress table (because the folder name `new_cohort` is different to the previous folder name `new_scans` which might be valuable metadata that we definetly do not want to loose).

However, we only expect `scan_4.ndpi` and `scan_5.ndpi` to be inserted into storage.

In [None]:
valid_scans = db.insert_from_ingress()

Whoops, what happend? 🤔

Well the filesystem and the database do not know that the ingress directory is no longer `new_scans` but `new_cohort`. So we need to first mount the new ingress directory.

Note that this of course does not reinitialize the databse, as all scans are stored in `scans.sqlite`.

In [None]:
db = Db(root_dir=root_dir, ingress_dir=ingress_dir, filename='scans.sqlite')
print(str(db))

In [None]:
valid_scans = db.insert_from_ingress()

In [None]:
print(str(db))

From the log we see: Everything worked out as expected. 🔥

In [None]:
shutil.rmtree(root_dir)