# Star-TREX run time estimation for large datasets

Since looping through tiles and decoding spots in each tile separately might take a long time for large dataset, the following pipeline allows to estimate the run time for given tile sizes by running one loop and extrapolating the full run time. Then the optimal tile size can be identified.

### Load required data

Load required packages.

In [None]:
# Load packages
from IPython import get_ipython
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
import sys
import os

sys.path.insert(1, os.path.abspath('..'))

ipython = get_ipython()
ipython.run_line_magic("gui", "qt5")
ipython.run_line_magic("matplotlib", "inline")

matplotlib.rcParams["figure.dpi"] = 150

Define your working directory and the path to the settings.yaml

In [None]:
work_dir = "/Users/leonievb/Library/CloudStorage/OneDrive-Personal/Postdoc/Data/02_4_Gene_Test2/OME-TIFF_MaxIP/"

settings_path= "/Users/leonievb/Library/CloudStorage/OneDrive-Personal/Postdoc/Data/02_4_Gene_Test2/4genepanel_dapi-488-568-657-750/OME-TIFF_MaxIP/settings.yaml"

Load the experiment

In [None]:
#Load experiment
from starfish import Experiment
exp = Experiment.from_json(os.path.join(work_dir, "spacetx", "primary", "experiment.json"))
print(exp)

### Calculate registration offset and save

To avoid calculating an registration offset each time, calculate the registration offset for the full image and store as json file. Skip the step if you have done this already. Use the settings.yaml file (can be found in star-trex/settings.yaml, make sure to adapt the settings to your data) to define settings that remain stable throughout the estimation and indicate the path to the file in the function call. If you wish to change some settings quickly, you can still do that in the function call below and it will overwrite the setting in the settings.yaml

In [None]:
from importlib import reload
from src import starfish_wrapper
reload(starfish_wrapper)
from src.starfish_wrapper import run
import os 
#Change these numbers as needed
x_transform = 2048
y_transform = 2048
save_transforms = os.path.join(work_dir, "transformation/transforms.json")
run(exp, x_step=x_transform, y_step=y_transform,settings_path=settings_path,
    test=False, transforms=None, save_transforms=save_transforms, just_register=True)

Now load your calculated offsets

In [None]:
transforms = os.path.join(work_dir, "transformation/transforms.json")
transforms

### Run first tile of different tile sizes

Let's estimate the run time of the pipeline with different tile sizes. The chosen tile sizes should be a fraction of the total length of the edge, e.g. if the x dimension of your image is 2048 pixels, the tiles could have an x length of 1024, 512, 256 etc. Currently, the code cannot handle tile sizes that are not a fraction of the total size. 
The image does not have to be a square, e.g. it can be 2048 x 2000 pixels. In that case different length of edges can be chosen, e.g. 512 px for the x-edge and 500 px for the y-edge. 

Define the tile sizes to be tested below. Make sure the list

In [3]:
x_test = [1024, 512, 256]
y_test = [1024, 512, 256]

if len(x_test) != len(y_test):
    raise Exception("The list x_test and y_test must have the same number of elements")

Now run the estimation. Be aware, that depending on the size of tiles, this might take a while.

In [None]:
from importlib import reload
from src import starfish_wrapper
reload(starfish_wrapper)
from src.starfish_wrapper import run

times = []
for i in range(len(x_test)):
    x_step = x_test[i]
    y_step = y_test[i]
    days, hours, minutes, seconds = run(exp, settings_path=settings_path, test=True, transforms=None, 
                                        save_transforms=None, just_register=False)
    times.append([days, hours, minutes, seconds])

### Inspect results

These are the estimated times

In [None]:
times

Now let's visualise the times as a function of tile size and decide for the best tile size

In [None]:
# Example data (replace with your actual data)
tile_sizes = x_test

runtimes = []
for time in times:
    runtime = (time[0] * 24 * 60 * 60) + (time[1] * 60 * 60) + (time[2] * 60) + time[3]
    runtimes.append(runtime)

# Create the dot plot with lines connecting the dots
plt.figure(figsize=(8, 6))

# Plot the lines connecting the dots
plt.plot(tile_sizes, runtimes, color='red', linestyle='-', marker='o')

# Add labels and title
plt.xlabel('Tile Size [pixels]')
plt.ylabel('Runtime [sec]')
plt.title('Runtime as a Function of Tile Size')

# Add grid
plt.grid(True)

# Show plot
plt.show()

Define the tile sizes to use in the future here. Make sure to adjust your settings.yaml accordinly.

In [None]:
x_step = 2048
y_step = 2048