# SegMine Task - Phenaros
*Viv Inglis* \
*2024-MAY-05*

The instructions are to do the following:
- Segment the cells into single-cell masks
- Harvest the cells - extract morphological information
- Calculate the location of each cell
- Mine profiles for each cell
- Aggregate the data as either the Mean or Median of each unique well-site combo
- Cluster the data using either UMAP or TSNE

## Segmenting the Cells using Cellpose

In [None]:
!pip install cellpose

I struggled a bit with installing the dependencies for Cellpose, and spent far too long trying to install the gui, but eventually succeeded after a hefty amount of googling, lots of StackOverflow, and a bit of ChatGPT help. Also I had tried to install DeepProfile prior to Cellpose, but found out that they require different versions of certain packages, so had to uninstall and start again. All this on an internet speed of roughly 8Mbps down since I wasn't at home on Saturday. Luckily, I returned that evening to a glorious ~200Mbps down and finally managed to download the image data files.

In [15]:
!python3.8 -m cellpose --dir /home/vivinglis/phenaros/data/images --verbose --pretrained_model nuclei --diameter 0. --save_png --savedir /home/vivinglis/phenaros/cellpose_output/images

2024-05-05 05:13:23,623 [INFO] WRITING LOG OUTPUT TO /home/vivinglis/.cellpose/run.log
2024-05-05 05:13:23,623 [INFO] 
cellpose version: 	3.0.7 
platform:       	linux 
python version: 	3.8.10 
torch version:  	2.3.0+cu121
2024-05-05 05:13:23,623 [INFO] >>>> using CPU
2024-05-05 05:13:23,772 [INFO] >>>> running cellpose on 3080 images using chan_to_seg GRAY and chan (opt) NONE
2024-05-05 05:13:23,772 [INFO] >>>> using CPU
2024-05-05 05:13:23,772 [INFO] >> nuclei << model set to be used
2024-05-05 05:13:23,926 [INFO] >>>> model diam_mean =  17.000 (ROIs rescaled to this size during training)
2024-05-05 05:13:23,927 [INFO] >>>> estimating diameter for each image
2024-05-05 05:13:23,936 [INFO] 0%|          | 0/3080 [00:00<?, ?it/s]
2024-05-05 05:13:23,943 [INFO] channels set to [0, 0]
2024-05-05 05:13:23,943 [INFO] ~~~ ESTIMATING CELL DIAMETER(S) ~~~
2024-05-05 05:13:59,359 [INFO] estimated cell diameter(s) in 35.42 sec
2024-05-05 05:13:59,359 [INFO] >>> diameter(s) = 
2024-05-05 05:13:59

After 9h 40min, had only segmented 9% of the images, since I don't have a GPU on my computer.
Based on that I decided to only use 220 of the images to test the pipeline: B02*-B23*.
I created symbolic links to a subset of the images in "~/phenaros/data/images_subset".
I then divided the masks generated by cellpose into "~/phenaros/cellpose_output/masks" for all B02*-B23* files and to "C_masks" for the additional ones that cellpose managed to generate in the time I gave it to run.

I also now realise I used the code from the manual for running it in the command line on nuclear data (grayscale) where the diameter is automatically estimated. In hindsight, I would have used this example to run it in the notebook:
https://cellpose.readthedocs.io/en/latest/notebook.html

**Note**: I'm not sure why I used the nuclei model, probably because it was late and I was tired, but if I could run it again I would've instead used cyto3 since it's the super-generalist model. Should've read the manual more closely!

In [70]:
!tree -d ~/phenaros/
!tree ~/phenaros/data/images_subset

[34;42m/home/vivinglis/phenaros/[00m
├── [01;34mcellpose_output[00m
│   ├── [01;34mC_masks[00m
│   └── [01;34mmasks[00m
├── [01;34mdata[00m
│   ├── [01;34mimages[00m
│   └── [01;34mimages_subset[00m
└── [01;34mdeepprofiler_output[00m
    ├── [01;34minputs[00m
    │   ├── [01;34mconfig[00m
    │   ├── [01;34mimages[00m
    │   ├── [01;34mlocations[00m
    │   ├── [01;34mmetadata[00m
    │   └── [01;34moutlines[00m
    └── [01;34moutputs[00m
        ├── [01;34mcompressed[00m
        │   └── [01;34mimages[00m
        ├── [01;34mintensities[00m
        ├── [01;34mresults[00m
        │   ├── [01;34mcheckpoint[00m
        │   ├── [01;34mfeatures[00m
        │   ├── [01;34mlogs[00m
        │   └── [01;34msummaries[00m
        └── [01;34msingle-cells[00m

23 directories
[01;34m/home/vivinglis/phenaros/data/images_subset[00m
├── [01;36mB02_s1_x0_y0_Fluorescence_405_nm_Ex.tiff[00m -> [01;35m../images/B02_s1_x0_y0_Fluorescence_405_nm_Ex.tiff[0

## Calculate Location using CellProfiler

Since DeepProfiler requires the cell locations, I would then run CellProfiler and store them as directed by the DeepProfiler manual.

https://cytomining.github.io/DeepProfiler-handbook/docs/04-metadata.html#single-cell-locations-file


### Extract Features with CellProfiler?

Didn't get a chance to run this, and couldn't really establish it from the brief read of the manual, but I assume that Cellprofiler may be able to extract the features such as:

1. DNA (Nucleus)
2. RNA (Nucleoli, cytoplasmic RNA)
3. ER (Endoplasmic reticulum)
4. AGP (F-actin cytoskeleton, Golgi, plasma membrane)
5. Mito (Mitochondria)

These would later be used to run the Cell Painting CNN model in DeepProfiler.

## Profiles using DeepProfiler

In [None]:
# Install DeepProfiler in command line
!git clone https://github.com/broadinstitute/DeepProfiler.git
!cd ~/DeepProfiler/
!pip install -e .

In [42]:
%cd ~/DeepProfiler
!python3.8 deepprofiler --root=/home/vivinglis/phenaros/deepprofiler_output setup

/home/vivinglis/DeepProfiler
2024-05-05 15:39:31.587106: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-05-05 15:39:31.587168: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Instructions for updating:
non-resource variables are not supported in the long term
Instructions for updating:
non-resource variables are not supported in the long term
Directory exists:  /home/vivinglis/phenaros/deepprofiler_output
Directory exists:  /home/vivinglis/phenaros/deepprofiler_output/inputs/locations/
Directory exists:  /home/vivinglis/phenaros/deepprofiler_output/inputs/config/
Directory exists:  /home/vivinglis/phenaros/deepprofiler_output/inputs/images/
Directory exists:  /home/vivinglis/phenaros/deepprofiler_output/inputs/metadata/
Directory exists:  /home/viv

In [57]:
%cd ~/phenaros/
!tree deepprofiler_output

/home/vivinglis/phenaros
[01;34mdeepprofiler_output[00m
├── [01;34minputs[00m
│   ├── [01;34mconfig[00m
│   ├── [01;34mimages[00m
│   ├── [01;34mlocations[00m
│   └── [01;34mmetadata[00m
└── [01;34moutputs[00m
    ├── [01;34mcompressed[00m
    │   └── [01;34mimages[00m
    ├── [01;34mintensities[00m
    ├── [01;34mresults[00m
    │   ├── [01;34mcheckpoint[00m
    │   ├── [01;34mfeatures[00m
    │   ├── [01;34mlogs[00m
    │   └── [01;34msummaries[00m
    └── [01;34msingle-cells[00m

15 directories, 0 files


### Preparing Metadata File

Since the cells have already been masked with Cellpose, the next step would be to prepare the metadata file for DeepProfiler and place it in the inputs/metadata/ directory.

*On the topic of masks* - DeepProfiler reccommends not using them for Cell Painting CNN model, so I would want to investigate that further to figure out what they mean.

The metadata file would be called "index.csv" and would contain the following columns:
Metadata_Plate, Metadata_Well, Metadata_Site, Channel_Name, Treatment and Replicate.

The well and site can be extracted from the image file names, e.g. for image file "B23_s2_x1_y0_Fluorescence_730_nm_Ex.tiff" the well is B23 and the site is s2. From the file names, it seems as if there are 2 sites per well.

However, the filenames for the images don't indicate which plate they belong to. According to DeepProfiler, it is expected that the images are in a folder named after the plate, but here they are just in an "images" directory, e.g.:
data/images/B02_s1_x0_y0_Fluorescence_405_nm_Ex.tiff

According to the metadata file (provided both in parquet and csv format on the GitHub page), the B02 well may belong to 1 of 4 different plates. Assuming the image files have been stored in deepprofiler_output/inputs/images, the metadata file would look something like this:

| Metadata_Plate | Metadata_Well | Metadata_Site | Channel_Name                                              | Treatment          | Replicate |
|----------------|---------------|---------------|-----------------------------------------------------------|--------------------|-----------|
| PB000103       | B02           | 1             | PB000100/B02_s1_x0_y0_Fluorescence_405_nm_Ex_cp_masks.png | negcon | 1         |

In the metadata it seems there are two treatment types - poscon and negcon, and two of each of these types.
If these are then considered replicates of the treatment types then it could be something like this:

| Treatment | Batch_ID           | Replicate |
|-----------|--------------------|-----------|
| poscon    | Etoposide          | 1         |
| poscon    | Paclitaxel         | 2         |
| negcon    | Dimethyl Sulfoxide | 1         |
| negcon    | D-Sorbitol         | 2         |

### Configuration File

The next step would be to prepare the configuration filer for DeepProfiler according to the below:

https://cytomining.github.io/DeepProfiler-handbook/docs/05-config.html#configuration-file-organization

And finally to run profiling using the CNN Cell Painting model:

https://cytomining.github.io/DeepProfiler-handbook/docs/06-profiling.html#profiling-with-cell-painting-cnn-model

## Aggregate and Visualise

Using the Downstream Analysis page on the DeepProfiler page, I would aggregate the data and visualise it using UMAP.

https://cytomining.github.io/DeepProfiler-handbook/docs/08-process.html#downstream-analysis


## Conclusions

Just reading through everything I did, I've noticed a lot of mistakes that I would want to go back and investigate further.
For example, setting up individual docker containers for each analysis tool to avoid conflicting dependencies, running the cyto3 model instead during the Cellpose step, figuring out what to do with the masks during the Cell Painting CNN step and reading all the manuals much more closely.

If I had time I would overhaul it all and start from the beginning now that I have a better understanding of each of the individual steps.



... Oh and I would become a PC gamer so that I would have had access to an NVIDIA graphics card.