# ***DESI Human Glioma Preprocessed Dataset Tissue Spectra Metadata export***

This notebook shows the process creating a metadata csv of all tissue spectra for later use of the DESI Human Glioma preprocessed dataset.

### ***Import packages***

Before we begin, let's import all the necessary packages for this notebook:

In [1]:
import os
import numpy as np
import pandas as pd
from skimage import (filters)
from pyimzml.ImzMLParser import ImzMLParser, getionimage
from tqdm.notebook import tqdm

%matplotlib inline

### ***Constants definitions***

Next, let's define some constant variables for this notebook:

In [2]:
# Define folder that contains the preprocessed dhg dataset
DHG_IN_PATH = "/sise/assafzar-group/assafzar/Leor/DHG/Preprocessed"
# Define file that contains clinical state anotations
LABELS_PATH = "/sise/assafzar-group/assafzar/Leor/DHG/Clinical_state_anotations.csv"
# Define file to export
DHG_OUT_PATH = "/sise/assafzar-group/assafzar/Leor/DHG/Preprocessed/Metadata.csv"

### ***Reading MSI clinical state anotations***

Next, lets read the clinical state anotations for each MSI:

In [3]:
# Read clinical state anotations csv
labels_df = pd.read_csv(LABELS_PATH)

### ***Export all tissue spectra metadata from all MSI:***

Next, let's export all tissue spectra metadata from all MSI:

In [4]:
# Create empty list to store loop dataframes
dfs = []

# Loop over each MSI
for index, msi_row in tqdm(labels_df.iterrows(), total=labels_df.shape[0], desc="MSI Loop"):
  # Parse the MSI file 
  with ImzMLParser(os.path.join(DHG_IN_PATH, f"{msi_row.file_name}.imzML")) as reader:
    # Get local TIC image of msi in mz region [600, 900]
    local_tic_img = getionimage(reader, 750, tol=150)

    # Threshold image to seperate tissue spectra from background
    smooth = filters.gaussian(local_tic_img, sigma=1.5)
    thresh_mean = filters.threshold_mean(smooth)
    thresh_img = local_tic_img > thresh_mean
    
    # Get tissue coordinates
    tissue_coordinates = np.argwhere(thresh_img == True)
    # Create dataframe of tissue coordinates
    loop_df = pd.DataFrame({"y": tissue_coordinates[:, 0], "x": tissue_coordinates[:, 1]})
    # Add sample type to each row
    loop_df["sample_type"] = msi_row.sample_type
    # Add file name to each row
    loop_df["file_name"] = msi_row.file_name
    # Add who_grade to each row
    loop_df["who_grade"] = msi_row.who_grade
    # Add histology to each row
    loop_df["histology"] = msi_row.histology
    # add idx to each row
    idxs = {(x[0], x[1]): idx for idx, x in enumerate(reader.coordinates)}
    loop_df["idx"] = loop_df.apply(lambda row: idxs[(row["x"] + 1, row["y"] + 1)], axis=1)
    
    # Append loop dataframe to dataframe list
    dfs.append(loop_df)

# Create one dataframe from list
df = pd.concat(dfs, ignore_index=True)
# Change type to ste for one hot encoding
df['who_grade'] = df['who_grade'].astype(str)
# One hot encode categorical columns and save csv
pd.get_dummies(df, columns=["who_grade","histology"]).to_csv(DHG_OUT_PATH, index=False)

MSI Loop:   0%|          | 0/48 [00:00<?, ?it/s]