<a href="https://colab.research.google.com/github/samuelhurni/ML-Cellsegmentation-HSLU-FS24/blob/feature_Sam/Code/Sartorius_segmentation_kaggle_Sam_v001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0.Sartorius - Cell Instance Segmentation
##Detect single neuronal cells in microscopy images

Project HSLU Master IT Digitalization & Sustainability
Module: Machine Learning and Data Science
* Samuel Hurni
* Pradanendr Sudev  
* Chakravarti Devanandini




### 0.1 General information and references

Used Third party Libraries:
* Pytorch
* TQDM
* Pandas
* Numpy
* gdown
* Matplotlib

Used Thid party Imports:
* Auxiliary functions metric: "https://www.kaggle.com/code/theoviel/competition-metric-map-iou
* Auxiliary functions for encoding and decoding the mask: "https://www.kaggle.com/code/enzou3/sartorius-mask-r-cnn"



References to Turtorials / Code documantation:
* Pytorch documentation: https://pytorch.org/docs/stable/index.html
* Pytorch Turtorial: https://www.learnpytorch.io/00_pytorch_fundamentals/
* Kaggel dataset for ideas: https://www.kaggle.com/code/enzou3/sartorius-mask-r-cnn
  * build own method based on `find_best_thresholds()`



  **Important:**
  **Please check the Hyperparameters for this File because this allows you for example to run the project with limited images or load pretrained models**

### 0.2 About the project

Link to the project as follow: https://www.kaggle.com/competitions/sartorius-cell-instance-segmentation




Main objectives:

This Kaggle competition is about creating a computer program to identify and outline individual nerve cells in microscope images. These nerve cells are important for studying brain diseases like Alzheimer's and brain tumors, which are major health problems worldwide. Typically, scientists look at these cells using a microscope, but finding each cell in the images can be tough and takes a lot of time. Doing this accurately could help find new treatments for these diseases.

The challenge is that current methods aren't very good at recognizing these nerve cells, especially a kind called neuroblastoma cells, which look very different from other cells and are hard to identify with existing tools.

Sartorius, a company that supports science and medicine research, is sponsoring this competition. They want participants to develop a method that can automatically and precisely identify different types of nerve cells in images. This would be a big step forward in neurological research, making it easier for scientists to understand how diseases affect nerve cells and possibly leading to the discovery of new medications.



Dataset:

The Dataset containa at arround xx images for training and xx images for testing. the goal would be to train a model whoch is able to segment neuronal cells.


The ground truth data to the images for training consist several meta data which includes also the masks for training the segmentation problem. These are specified field of ecah datapoint:


* _id - unique identifier for object_

* _annotation - run length encoded pixels for the identified neuronal cell_

* _width - source image width_

* _height - source image height_

* _cell_type - the cell line_

* _plate_time - time plate was created_

* _sample_date - date sample was created_

* _sample_id - sample identifier_

* _elapsed_timedelta - time since first image taken of sample_



## 1.Preparations: Loading Dataset and install or import Packages

### 1.1 Install adn import third party packages / functions

In this chapter we install the third party packages which maybe are not installed in the prebuild google collab or on your local system

* Tqdm --> progress bar
* gdown --> Import google drive package


In [None]:
# Install gdown:
try:
    import gdown
except ImportError:
    !pip install gdown

# Install tqdm:
try:
    import tqdm
except ImportError:
    !pip install tqdm

In [None]:
# General Import which are used in this file
import pandas as pd
import numpy as np
import string
import os.path
import os
from tqdm.auto import tqdm
from PIL import Image
import torch
from torchvision import transforms
from torch.utils.data import Dataset
from sklearn.preprocessing import MultiLabelBinarizer
import requests
import requests
import zipfile
from pathlib import Path
import gdown
from sklearn.metrics import fbeta_score

### 1.2 Define custom functions for this Project:

In this chapter we are defining custom functions which we are using throughout this project:

* `show_train_time` function to show the time how long the coputation of the model takes
* `folder_content` function to display what is inside a folder
* `check_drop_image_existence` function to drop from the label dataset images which are not in the file system
* `accuracy_fn` function for multi-label calssification problems
* `plot_loss_values` for plotting the loss and accuracy to detect under or overfitting
* `model_rating` gives back the rating of the model with accuracy and score for a given dataloader dataset
* `make_pred` make predictions with a model based on test data
* `combine_models_predictions_2` combines the results of two models with the size of 4 and 13 labels to a result of 17 labels
* `make_pred_combined` make predctions for the combined approach  with two models, one for the weather labels and another for the other labelsm



In [None]:
# Define the timing function:
from timeit import default_timer as timer
def show_train_time(start:float,
                     end:float,
                     device: torch.device = None):
  """Show differnences between start and end time for calculation the performance of a pytorch model"""
  total_time = end - start
  print(f"Train time on {device}: {total_time:.3f} seconds")
  return total_time

In [None]:
def folder_content(directory_path):
  """
  Iterating thorugh all folders in the path and display the content.
  Args:
    directory_path --> Path to start iteration

  Returns:
    Show information about:
      subdiretories in dir_path
      number of files in each subdirectory
      name of each subdirectory
  """
  for dirpath, dirnames, filenames in os.walk(directory_path):
    print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'.")

In [None]:
def check_drop_image_existence(label_data: pd.DataFrame, images_dir : string):
  """
  Method which cleans the label dataframe by checking the existence of images
  """
  data_frame = label_data
  index_drop = []
  #print(f"Check Images in Folder {images_dir}")
  for index, row in tqdm(data_frame.iterrows(), desc="Checking if File Exists....."):
    path_to_check = os.path.join(images_dir, row['image_name'])
    file_exists = os.path.isfile(path_to_check)
    if file_exists == False:
      # File does not exist, drop row from dataset
      #print(f"File: {row['image_name']} in Label file does not exist as image and will be deleted from the label file")
      index_drop.append(index)

  #Drop all rows in index_drop
  for index in tqdm(index_drop, desc="Deleting rows in label dataset....."):
    data_frame.drop(index, inplace=True)
  return data_frame

### 1.3 Checking for GPU and device agnostic code (Cuda(Nvidia / Apple Silicon)

In this chapter we are checking if Hardware from Nvidia (Cuda framework) pr Apple Silicon (M1-M3) is available and switching the device

In [None]:
#Setup device agnostic code
import torch
device="cpu"
if torch.backends.mps.is_available():
  print("Metal available with Apple Silicon GPU")
  device = "mps"
elif torch.cuda.is_available():
  device = "cuda"
  print("Cuda available with Nvidia GPU")

### 1.4 Define Hyperparameter for Project:
Her we have the hyperparameters for all three models:




### 1.5 Downloading the dataset to Google Colab

In this chapter we are downloading the dataset from a public Google Drive link to this colab instance. This is necessary to decrease the request time per image to the dataset:

In [None]:
from pathlib import Path
import zipfile
import gdown

# Setup paths and folders names and urls
data_path = Path("dataset")
download_path = Path("kaggledownload")

dataset_url = 'https://drive.google.com/uc?id=1syZoLGGeFiFErCFL_iI1VO_4k2jLEaPv&confirm=t'


# If the image folder doesn't exist, download it
if data_path.is_dir():
    print(f"{data_path} directory exists.")
else:
    print(f"Did not find {data_path} directory, creating one...")
    data_path.mkdir(parents=True, exist_ok=True)
    download_path.mkdir(parents=True, exist_ok=True)

print("Downloading dataset...")
gdown.download(dataset_url, str(download_path / "sartorius-cell-instance-segmentation.zip"), quiet=False)




In [None]:

# Unzip data
with zipfile.ZipFile(str(download_path / "sartorius-cell-instance-segmentation.zip"), "r") as zip_ref:
    print("Unzipping train dataset data...")
    zip_ref.extractall(data_path)

## 2.Dataset preparation

In this chapter we are preparing our dataset, that we are able to load it later on into a Pytorch Dataloader:

* Load dataset from path
* Add missing ending ".jpg" that column name matches to filename
* Splitting up label classes from label column
* Add vectrized column `[0,1,0,0,1,....]`