# **Sartorius - Cell Type Image Classification - Tensorflow Approch**

# Summary

- This notebook should adress the task to classify the cell type astro, cort or shyshy in order to be able
  to train a own segmentation network for every singly cell type - which could lead to better results, not has to.
  
- The architecture is a resnet (50) with a small number of additional standard layers. It is all done via TF/Keras.

- The images used for this are the images from \train and \train_semi_supervised. These roughly 2500 images with a shape of (520, 704) 
  are then split up to tiles with a shape of (128,128) (the small remaining parts of the oiginal images were discarded),
  the images were then resized and also extended to rgb (via generator) to (224,224)
  in order to get the best results for the resnet (which was aligned with this image size). 
  Using no tiles with a resizing to also (224,224) brought me also to a high validation accuracy but first in the end of the training - after 30 epochs
  beginnig to constantly rising to 0.95, but before that there was a plateau at about 0.3 (random guessing with 3 classes).
  With he tiles approch I got a fast and more stable training -> see k-fold cross validation below with a 0.99 validation accracy score.
  The k fold validation in this case uses stratifiied sampling in order to avoid overfitting by 
  spreading the tiles of one image to the train or validation set not both.
  
- Shoutout to Yoshi_K and his notebook (https://www.kaggle.com/yoshikuwano/classified-by-cell-types-before-segmentation-1-2),
  who achieved similar results with his Pytorch approach, and thanks for the communication for this problem.
  I had to change my setting so I could not reproduce his results with the same topology, so I had to use a much much smaller learning rate,
  with higher ones I got a very useless fluctuating accuracies from 0.3 to 0.95.
  
- In this notebook the functions needed are provided first, then in the Training section, there are the function called in order to
  get the trainig loop done. In the ouput of the notebook the final weights are stored.
  
- I would be very pleased if the notebook is at least a bit useful for some people and even its only the k-fold plot
  and i would be very very thankful for any discussion, questions, tips, mistakes other approaches and so on to get better at deep learning.
  
  Much Love from Germany, Andreas!




# Functions

## *Global Variables*

In [None]:
batch_size    = 32
img_len_x     = 224
img_len_y     = 224 
input_shape = (img_len_x,img_len_y,1)
target_size = input_shape[0:2]
num_classes = 3

## *Imports*

In [None]:
### Import packages ###

#  math operations
import numpy as np
import pandas as pd
import random

# image processing
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.image as mpimg
%matplotlib inline
import cv2
from PIL import Image

# deep learning frameworks
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.densenet import DenseNet121
from tensorflow.keras.layers import Dense, AveragePooling2D, MaxPooling2D, BatchNormalization, Flatten, Input, Dense, Conv2D, Dropout, ZeroPadding2D, Convolution2D
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD
import tensorflow as tf
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.callbacks import LearningRateScheduler
import gc

# splitting data frame
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

# path management
import os
import shutil

# display otitions
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

# measuring time
import time
from tqdm import tqdm

## *Helper Functions*

In [None]:
### print(color.BOLD + color.RED + "" + color.END)

class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'


# use color shemes for k fold plots extend if k > 5 should be used
list_colors_red  = ['darkviolet','mediumpurple','violet','fuchsia','red']
list_colors_blue = ['lightgreen','palegreen','mediumspringgreen','seagreen','darkolivegreen']

## *Data Integration and Preprocessing*

- load train df

In [None]:
def load_train_csv(path = "../input/sartorius-cell-instance-segmentation/train.csv", print_df = True):
 
  train_df = pd.read_csv(path)
  
  if print_df: 
    print("Csv: ", path ,"\n")
    print(f'Training Set Shape: {train_df.shape} - {train_df["id"].nunique()} Images \n')
    print("Head of df:")
    print(train_df.head(10))
    print("")
    print("different cell types:")
    print(train_df["cell_type"].drop_duplicates().reset_index(drop=True),"\n")

  else:
    return train_df  


- extract .csv from train semi supervised images 

In [None]:

def get_train_semi_supervised_csv(path = "../input/sartorius-cell-instance-segmentation/train_semi_supervised", print_df = True, store_on_drive = False):


  # source of images
  train_semi_supervised_images = os.listdir("../input/sartorius-cell-instance-segmentation/train_semi_supervised")
 
  # get emtpy dataset from train.csv and emtpy it
  train_df = load_train_csv(print_df = False)
  train_df = train_df[0:0]

  # function to get information for every file
  def get_cols_from_filename(filename):
      
      image_id = filename.split('.')[0]
      cell_type = filename.split('[')[0]
      filename_split = filename.split('_')
      plate_time = filename_split[-3]
      sample_date = filename_split[-4]
      sample_id = '_'.join(filename_split[:3]) + '_' + '_'.join(filename_split[-2:]).split('.')[0]
      
      return image_id, cell_type, plate_time, sample_date, sample_id

  # getting information of images
  for filename in train_semi_supervised_images:

      image_id, cell_type, plate_time, sample_date, sample_id = get_cols_from_filename(filename)
      sample = {
          'id': image_id,
          'annotation': np.nan,
          'width': 704,
          'height': 520,
          'cell_type': cell_type,
          'plate_time': plate_time,
          'sample_date': sample_date,
          'sample_id': sample_id
      }
      train_df = train_df.append(sample, ignore_index=True)
    
  train_df['cell_type'] = train_df['cell_type'].str.rstrip('s')

  train_df.to_csv('./train_semi_supervised.csv', index= False)

  if store_on_drive == True:
    train_df.to_csv('./train_semi_supervised.csv', index= False)

  if print_df:
    print(f'Training Set Shape: {train_df.shape} - {train_df["id"].nunique()} Images \n')
    print("Head of df:")
    print(train_df.head(10))
    print("")
    print("different cell types:")
    print(train_df["cell_type"].drop_duplicates().reset_index(drop=True))

  else:
    return train_df



- load train.csv and the just created rain_semi_supervised.csv to get full dataframe

In [None]:
def load_df(return_df = True): 
  
  # image path:
  path_train = "../input/sartorius-cell-instance-segmentation/train/"
  path_train_semi_supervised = "../input/sartorius-cell-instance-segmentation/train_semi_supervised"


  # train.csv -> get only one row per images, shorten because of annotations
  df_1 = pd.read_csv("../input/sartorius-cell-instance-segmentation/train.csv")
  df_1 = df_1.drop(columns=['annotation'])
  df_1 = df_1.drop_duplicates().reset_index(drop=True)
  df_1["csv"] = "train.csv"
  df_1["image_path"] = df_1["id"].apply(lambda x: path_train + "/" + x + ".png")
  print("\nlength of train:", len(df_1))

   
  # train<-semi_supervied.csv -> already only one row per images
  df_2 = pd.read_csv("./train_semi_supervised.csv")
  df_2 = df_2.drop(columns=['annotation'])
  df_2["csv"] = "train_semi_supervised.csv"
  df_2["image_path"] = df_2["id"].apply(lambda x: path_train_semi_supervised + "/" + x + ".png")
  print("\nlength of train semi supervised:", len(df_2))

  # combine both
  df_train = pd.concat([df_1, df_2])
  df_train = df_train.reset_index(drop=True)

  # drop unneccesary cols
  df_train = df_train.drop(columns=["sample_date", "height", "width", "sample_id", "elapsed_timedelta", "plate_time"])

  # checks
  print("\nlength of train df:", len(df_train),"\n")
  print(color.BOLD  + "glimpse at df:" + color.END)
  print(df_train.head(3))
  print(df_train.tail(3))
  print("\nDistribution of cell types:")
  print(df_train['cell_type'].value_counts())

  if return_df == True:
    return df_train
  

- Collection of Preprocess Data Funtions

In [None]:
###  several functions for image preprocessing


def image_to_array(image_path, rgb = 0):
  'rgb = 0: reads as grayscale'
  'rgb = 1: reads as rgb'
  image = cv2.imread(image_path, rgb)
  return image


def store_as_png(image, save_path):
    img_save_arr = Image.fromarray(image) 
    img_save_arr.save(save_path)


def cropp_image(image, target_size=target_size):
  image = image[0:target_size[0], 0:target_size[1]]
  return image


def contrast_enhancing_right(image, quantil = 0.95):
    'shifting all to the left quantil value'

    quantil = np.quantile(image, quantil)
    image_c = (image / quantil * 1.0).astype(np.float32)
    image_c[image_c > 1.0] = 1.0
    #image_c = image_c[..., np.newaxis]
    
    return image_c

 
def contrast_enhancing_left_and_right(image, quantil = 0.975):
    'moving to the left and then'
    'shifting all to the left quantil value'

    quantil_left = np.quantile(image[image>0.], 1 - quantil)
    image_c = image - quantil_left
    image_c[image_c < 0.] = 0.

    quantil_right = np.quantile(image_c, quantil)
    image_c = (image_c / quantil_right * 1.0).astype(np.float32)
    image_c[image_c > 1.0] = 1.0

    return image_c
        
    
def cv2_hist_eq(image):
    'using pre biuld cv2 preset'
    image = (image * 255).astype(np.uint8)

    image_c = cv2.equalizeHist(image)
    image_c = (image_c / 255.).astype(np.float32)
    image_c[image_c > 1.0] = 1.0
    
    image_c = image_c[..., np.newaxis]

    return image_c

- making tiles and visualize check ( if tiling is True, tiles are created by the given "len_tile", no further resizing (generator does it by itsself),
  if tiling is false, the images are just loaded and resized to the given target size, see "Global Variables"-section above.
  All files are then stored in the output folder.

In [None]:
# preprocess data and store it as image png files in extra folder /content/colab/processed_data
# get new train df

def making_tiles(df, path = './', len_tile = 64, tiling = True):
    
    
  # getting paths for the images in train folder
  image_paths = list(df['image_path'])
  ids = list(df['id'])
  #print(image_paths[0:3])
      
  # list of images loaded as array
  images_array = [image_to_array(image_path) for image_path in image_paths]
  print("image shape: " , images_array[0].shape)
 
  save_paths = []
  ids_processed = []
  shape_x = 520 # x is axis=0 (vertical, down to bottom)
  shape_y = 704 # y is axis=1 (horizontal, left to right)

  n_tiles = (shape_x // len_tile) * ((shape_y // len_tile))
  n_tiles_x = (shape_x // len_tile)
  n_tiles_y = ((shape_y // len_tile))

  if tiling == True:
    print("Number of tiles: ", n_tiles, " - tile shape: ", "(",len_tile,",",len_tile,")", "- sum of pixels: ", (len_tile * len_tile* n_tiles) ,  " - rest pix : ", shape_x*shape_y - len_tile * len_tile * n_tiles)
    print("loss in %: ", round((shape_x*shape_y - len_tile * len_tile * n_tiles)/(shape_x*shape_y),2)*100,"%")
          
    # store tiles of images
    print("\n")
    for i in range(len(image_paths)): #images
      
      img = images_array[i]
      img = img[0:shape_x, 0:shape_y]
      
      id = ids[i]
      count_tiles_per_id = 0
      
      for j in range(0, n_tiles_x): 
        for k in range(0, n_tiles_y):

          count_tiles_per_id += 1
          save_path = path + id + "_" + str(count_tiles_per_id).zfill(2) + str(".png")
          save_paths.append(save_path)
          ids_processed.append(id)

          img_tile =  img[ (j * len_tile) : ((j+1) * len_tile), (k * len_tile) : ((k+1) * len_tile)]
          store_as_png(img_tile, save_path)


  if tiling == False:
    print("No tiling, keep images and resize it to target size")
    for i in range(len(image_paths)): #images
      
      img = images_array[i]
      img = cv2.resize(img, target_size)
      
      id = ids[i]
    
      save_path = path + id + str(".png")
      save_paths.append(save_path)
      ids_processed.append(id)
      store_as_png(img, save_path)


  # visualize example
  if tiling == False:
    id = 0
    img_original = cv2.imread(save_paths[id], 0)
    f, ax = plt.subplots(nrows = 1, ncols = 1, figsize=(12, 12))
    ax.imshow(img_original, cmap="gray")
    ax.axis('off')
    plt.title(f'img original - with resized shape: {img_original.shape}' , fontsize = 16)
    plt.show()

  if tiling == True:
    id = 0
    img_original = cv2.imread(image_paths[id], 0)
    f, ax = plt.subplots(nrows = 1, ncols = 1, figsize=(12, 12))
    ax.imshow(img_original, cmap="gray")
    ax.axis('off')
    plt.title(f'img original - Shape: {img_original.shape} - and tiles: {img_tile .shape}' , fontsize = 16)
    plt.show()

    c = 0
    f, ax = plt.subplots(nrows = n_tiles_x, ncols = n_tiles_y, figsize=(12, 8))
    f.subplots_adjust(hspace=0.01)
    for j in range(0, n_tiles_x): 
      for k in range(0, n_tiles_y):
          c += 1
          save_path = path + ids[id] + "_" + str(c).zfill(2) + str(".png")
          tile = cv2.imread(save_path, 0)
          ax[j][k].imshow(tile, cmap="gray", vmin=0, vmax=255)
          ax[j][k].axis('off')
    plt.show()

  # make df processed
  frame = {'id':ids_processed, 'image_path': save_paths}
  df_processed = pd.DataFrame(frame)
  df_processed = df_processed.merge(df[["id","cell_type"]], how = 'left', left_on = "id", right_on = "id")

  print("\nlength of train df:", len(df_processed),"\n")
  print(color.BOLD  + "glimpse at df processed:" + color.END)
  print(df_processed.head(5))

  # free ram
  del images_array
  del save_paths
  del image_paths
  gc.collect()

  
  return df_processed


- Resampling train df to adress imbalanced data ("DOWN", "UP" or None)

In [None]:
# RESAMPLING

def resampling(df_train, resampling_on = "DOWN", label = "cell_type"):


  distribution_df = df_train[label].value_counts().rename_axis(label).reset_index(name='counts')
  print('Distribution target without Resampling:\n', distribution_df)

  # new dataframe for resampling
  df_res = df_train.copy()
  counts = distribution_df['counts']
  y      = df_res[label] 


  if resampling_on == 'DOWN':

    min_counts = min(counts)

    for ind in range(0,num_classes):

      if (((counts.iloc[ind]) - min_counts)/(counts.iloc[ind]))>0.01:
        class_i = distribution_df[label].iloc[ind]
        list_indices_i   = list(df_res[y == class_i].index)
        list_indices_i_sample = random.sample(list_indices_i,counts[ind]-int(min_counts*1.001))
        list_indices_i_sample = list(list_indices_i_sample)  
        list_indices_i_sample = sorted(list_indices_i_sample)
        df_res = df_res.drop(list_indices_i_sample, axis = 0)

    df_res = df_res.reset_index(drop=True)
        

  if resampling_on == 'UP':

    max_counts = max(counts)

    for ind in range(0,num_classes):

      if ((max_counts - counts.iloc[ind])/(max_counts))>0.05:
        class_i = distribution_df[label].iloc[ind]          
        list_indices_i   = list(df_res[y == class_i].index)
        list_indices_i_sample = random.choices(list_indices_i, weights = None, k = (max_counts-counts[ind]))
        list_indices_i_sample = list(list_indices_i_sample)
        df_res_up = df_res.iloc[list_indices_i_sample]
        df_res = df_res.append(df_res_up)
                
    df_res = df_res.sample(frac=1).reset_index(drop=True)       


  print('\n\nResults of Resampling: --', resampling_on, '---\n')
  distribution_df_res = df_res[label].value_counts().rename_axis(label).reset_index(name='counts')     
  print('Distribution target ressampled:\n',distribution_df_res , '\n')
  print('original data:  ',df_train.shape[0],'\nresampled data: ',  df_res.shape[0])  
                        
       
  return df_res            
                

## *Training*

- get single fold (for quick overview over training performance and for final training on whole data set")

In [None]:
### train and validation set split with train df

def single_fold(train_df, label = None, test_size = 0.3):

  print("Glimpse on train df before shuffleing")
  print(train_df.head())
  print("")
  print("Glimpse on train df after shuffleing")
  train_df = train_df.sample(frac=1)
  print(train_df.head())
  print("\n")
    
  # parameters
  random_state = 1
  test_size    = test_size

  X = train_df.drop(label, axis = 1)
  y = train_df[label]       
        
  X_train, X_val, y_train, y_val = train_test_split(      X, 
                                                          y, 
                                                          shuffle = True,
                                                          test_size = test_size, 
                                                          random_state = random_state)
        
  X_train = X_train.reset_index(drop=True)
  X_val   = X_val.reset_index(drop=True)
  y_train = y_train.reset_index(drop=True)
  y_val   = y_val.reset_index(drop=True)
        
        
    
   
  # glimpse at train and validation sets
  print("\n")
  print(color.BOLD + "Glimpse on splitted data:" + color.END)
  print(color.BOLD + "X_train  head and tail:"  + color.END)
  print(X_train.head())
  print(X_train.tail())
  print("")
  print(color.BOLD + "y_train head and tail: " + color.END)
  print(y_train.head())
  print(y_train.tail())
   
  # check some specs
  print("\n")
  print("\n")
  print("length of train df: ", len(train_df))
  print("")
  print("length of X train:  ", len(X_train))
  print("length of X val  :  ", len(X_val))
  print("")
  print("length of X train and val  :  ", len(X_train)+len(X_val))
  print("length of y train:  ", len(y_train))
  print("length of y val:    ", len(X_val))
  print("length of y train and val  :  ", len(y_train)+len(y_val))
        
    
    # concat for data generator
  X_y_train = pd.concat([X_train, y_train], axis = 1)
  X_y_val   = pd.concat([X_val,   y_val],  axis = 1) 
               
  print("")
  print(color.BOLD + "X_y_train head and tail: "  + color.END)
  print(X_y_train.head())
  print(X_y_train.tail())
        
 
  return X_y_train, X_y_val    


- k fold stratified

In [None]:
### train and validation set split with train df and k fold split

def get_k_fold_stratified(df,  k = 5, strat = None):

    'get k stratified folds of df along strat stored in dictionary X_y_dict: X per k_fold_i'
    
    print("")
    print("Stratified after column: ", strat)
    
    # reducing df to strat
    df_strat = df[[strat]]
    df_strat = df_strat.drop_duplicates()
    df_strat = df_strat.reset_index(drop=True)
    
    #shuffeling df
    print("Glimpse on df_strat before shuffleing")
    print(df_strat.head())
    print("")
    print("Glimpse on train df_strat after shuffleing")
    df_strat = df_strat.sample(frac=1)
    print(df_strat.head())
    print("\n")
    print("Glimpse on train df_strat after shuffleing and resetting index")
    df_strat =  df_strat.reset_index(drop=True)
    print(df_strat.head())
    print("\n")
    
    # parameters
    len_df = len(df)
    len_df_strat = len(df_strat)
    len_k_fold = int(len_df_strat // k) 
    
    if len_k_fold == 0:
        raise Exception("k higher than len df_strat!")
    
    # creating folds
    X_y_dict = {}
    
    for i in range(k):
        key_dict = "k_fold_" + str(i+1)
        
        # list of indices last one gets all data not to fotget the last line in case if uneven len df
        range_low_i = i * len_k_fold
        range_up_i  = ((i + 1) * len_k_fold) if i < (k-1) else len_df_strat
        list_indices_i = list(range(range_low_i, range_up_i))
        
        X_strat = df_strat.iloc[list_indices_i]
        
        # merge other colums to X_strat
        X_merged = X_strat.merge(df, how = 'left', left_on = strat, right_on = strat)
        
        X_y_dict[key_dict] = X_merged
            
        print(key_dict, "length of fold_strat: ",  len(X_strat),"/", round(len(X_strat)/(len_df_strat)*100,0),"%")
        print(key_dict, "length of fold_merged: ", len(X_merged),"/", round(len(X_merged)/(len_df)*100,0),"%")
        print("")
        
    # check if length of folds are the same 
    len_check = 0
    for i in range(k):
        len_check = len(X_y_dict["k_fold_" + str(i+1)]) + len_check
    if len_check != len_df:
        raise Exception("Error in splitting!")
        
    print("========================================")
    print("Length of df_merged: ", len_df)
    print("Length of folds: ",     len_check)
    
    return X_y_dict


- unstratified fold

In [None]:
### train and validation set split with train df and k fold split

def get_k_fold_unstratified(df,  k = 5):

    'get k unstratified folds of df stored in dictionary X_y_dict: X per k_fold_i'
    
    print("")
    print(color.BOLD +  "Get k Folds" + color.END + "\n")
    
    #shuffeling df
    #print("Glimpse on df before shuffleing")
    #print(df.head())
    #print("")
    #print("Glimpse on f after shuffleing")
    df = df.sample(frac=1)
    #print(df.head())
    #print("\n")
    print("Glimpse on df after shuffleing and resetting index")
    df =  df.reset_index(drop=True)
    print(df.head())
    print("\n")
    
    # parameters
    len_df = len(df)
    len_k_fold = int(len_df // k) 
    
    if len_k_fold == 0:
        raise Exception("k higher than len df!")
    
    # creating folds
    X_y_dict = {}
    
    print("Results k folding:")
    
    for i in range(k):
        key_dict = "k_fold_" + str(i+1)
        
        # list of indices last one gets all data not to fotget the last line in case if uneven len df
        range_low_i = i * len_k_fold
        range_up_i  = ((i + 1) * len_k_fold) if i < (k-1) else len_df
        list_indices_i = list(range(range_low_i, range_up_i))
        
        X = df.iloc[list_indices_i]
        X_y_dict[key_dict] = X 
            
        print(key_dict, ": length of fold: ", len(X),"/", round(len(X)/(len_df)*100,0),"%")
        print("")
 
    # check if length of folds are the same 
    len_check = 0
    for i in range(k):
        len_check = len(X_y_dict["k_fold_" + str(i+1)]) + len_check
    if len_check != len_df:
        raise Exception("Error in splitting!")
        
    print("========================================")
    print("Length of df: ", len_df)
    print("Length of folds: ", len_check)
    
    return X_y_dict    


- Image Data Generator for train and validation (for single fold training)

In [None]:
### get image data geneator

def get_data_generator_train(X_y_train, X_y_val, rescale = 1./255., label = None,  color_mode = None,
                             preprocessing_function = None, shuffle = True,  batch_size = batch_size,
                             samplewise_center = False, samplewise_std_normalization = False,  horizontal_flip = False, vertical_flip = False, target_size = target_size):
        
    ###### ImageDataGenerator
   
    train_datagen = ImageDataGenerator(rescale = rescale,
                                       rotation_range = 0,
                                       width_shift_range = 0.,
                                       height_shift_range = 0.,
                                       shear_range = 0.,
                                       zoom_range = 0., 
                                       fill_mode = 'nearest',
                                       samplewise_center = samplewise_center,
                                       samplewise_std_normalization = samplewise_std_normalization ,
                                       horizontal_flip = horizontal_flip,
                                       vertical_flip = vertical_flip,
                                       preprocessing_function = preprocessing_function,
                                       dtype= 'float32')

    valid_datagen = ImageDataGenerator(rescale = rescale,
                                       samplewise_center = False,
                                       samplewise_std_normalization = False,
                                       preprocessing_function = preprocessing_function,
                                       dtype= 'float32')
    
    
    ##### flow from Dataframe
    
    train_generator = train_datagen.flow_from_dataframe(
                                        X_y_train,  
                                        x_col = 'image_path',
                                        y_col =  label,
                                        target_size = target_size, # would be resized to
                                        color_mode =  color_mode,
                                        batch_size  = batch_size,
                                        shuffle = shuffle,
                                        class_mode = "categorical")


    validation_generator = valid_datagen.flow_from_dataframe(
                                        X_y_val,  
                                        x_col = 'image_path',
                                        y_col = label,
                                        target_size = target_size, # would be resized to
                                        color_mode =  color_mode,
                                        batch_size = batch_size,
                                        shuffle = shuffle,
                                        class_mode = "categorical")
    
    return  train_generator, validation_generator


- Class weights for imbalanced data (another option instead of resampling metho see above)

In [None]:
# get class weights
def get_class_weights(train_datagen, X_y_train, label="cell_type"):

  class_weights={}

  dict_generator = train_datagen.class_indices

  print("class labels: ", dict_generator)

  samples = list(X_y_train[label])
  len_ = len(samples)
 
  
  for class_ in list(set(samples)):
  
    count = samples.count(class_) 
    weight = round(len_/count, 4)
    class_num = dict_generator[class_]
    class_weights[class_num] = weight 
  
  print("class weights: ", class_weights)

  return class_weights


- get cnn model (own implementation)

In [None]:
### CNN model architecture

def define_model(input_shape, num_classes = num_classes, print_summary = True):
      

    # architecture
    model = Sequential([
                         Conv2D(8, (11, 11), padding = "SAME", input_shape=input_shape),
                         BatchNormalization(),
                         Activation('relu'),
                         AveragePooling2D(pool_size=(2, 2), strides = (2,2)),
                                          
                         Conv2D(16, (5, 5), padding = "SAME"),
                         BatchNormalization(),
                         Activation('relu'),
                         AveragePooling2D(pool_size=(2, 2), strides = (2,2)),
 
                         Conv2D(32, (3, 3), padding = "SAME"),
                         BatchNormalization(),
                         Activation('relu'),
                         AveragePooling2D(pool_size=(2, 2), strides = (2,2)),
                        
                         Conv2D(32, (3, 3), padding = "SAME"),
                         BatchNormalization(),
                         Activation('relu'),
                         AveragePooling2D(pool_size=(2, 2), strides = (2,2)),
                                                                   
                         Conv2D(32, (3, 3), padding = "SAME"),
                         BatchNormalization(),
                         Activation('relu'),
                         AveragePooling2D(pool_size=(3, 3), strides = (2,2)),
        
                         Conv2D(32, (3, 3), padding = "SAME"),
                         BatchNormalization(),
                         Activation('relu'),
                         AveragePooling2D(pool_size=(3, 3), strides = (2,2)),
           
                                          
                         Flatten(),
                         Dense(16),
                         Activation('relu'),
                         Dropout(0.2),
                         Dense(num_classes, activation = "softmax")
                        ])

    
   
    # summary
    if print_summary == True:
      model.summary()
        
    
    return model


- use a pretrained model

In [None]:
### load keras pretrained model

def get_pretrained_model(target_size, num_classes, print_summary = True, trainable_last_layers = 0):


  # load model
  inputs = tf.keras.Input(shape=(target_size[0], target_size[1], 3))
  resnet = tf.keras.applications.ResNet50(weights="imagenet",include_top=False,input_tensor=inputs)

  # freeze layers
  for i,layer in enumerate(resnet.layers):

    if (len(resnet.layers) - i) <= trainable_last_layers:
      layer.trainable = True
      
    else:
      layer.trainable = False   
    
  #for  i,layer in enumerate(resnet.layers):
    #print(i, layer.name,"-", layer.trainable)
      
  # add specific new layers
  model = tf.keras.models.Sequential()
  model.add(resnet)
  model.add(MaxPooling2D(pool_size=(3, 3)))
  model.add(Flatten())
  model.add(Dense(256, activation='relu'))
  model.add(Dropout(0.25))
  model.add(BatchNormalization())
  model.add(Dense(64, activation='relu'))
  model.add(Dropout(0.1))
  model.add(BatchNormalization())
  model.add(Dense(num_classes, activation='softmax'))

  # summary
  if print_summary == True:
    model.summary()
    
  return model

- compile model

In [None]:
def compile_model(model, optimizer = "RMSprop", lr = 0.001, print_opt = True):

    # get optimizer
    if optimizer == "SGD":
      opt = tf.keras.optimizers.SGD(learning_rate = lr)
    if optimizer == "RMSprop":
      opt = tf.keras.optimizers.RMSprop(learning_rate  = lr)
    if optimizer == "Adam":
      opt = tf.keras.optimizers.Adam(learning_rate = lr)
    
        
    model.compile(loss='categorical_crossentropy', optimizer = opt, metrics=["accuracy", tf.keras.metrics.AUC(name="AUC")])
    
    # summary
    if print_opt == True:
      print("Used optimizer : ", opt)
      print("Learning rate from optimizer:", K.eval(model.optimizer.lr), "\n")     
        
    return model


- Fit model with generator for single fold trainig

In [None]:
### full model: fit with generator

def fit_model_cnn(model, train_generator, validation_generator, n_epochs = 10, class_weights = None):

            
    #checkpoint 
    checkpoint_filepath = './sartorius_cell_type_classification.h5'
    
    checkpoint = tf.keras.callbacks.ModelCheckpoint(
                                                     filepath = checkpoint_filepath,
                                                     save_weights_only = True,
                                                     monitor = 'val_accuracy',
                                                     verbose = True,
                                                     mode = 'max',
                                                     save_best_only = True,
                                                     save_freq = 'epoch')

    # early stop
    early_stop = EarlyStopping(monitor = 'val_loss', patience = 5, verbose = True)


    CNN_history     =     model.fit(
                                    train_generator, 
                                    steps_per_epoch = int(len(X_y_train)/batch_size) ,
                                    validation_data =  validation_generator,
                                    epochs = n_epochs,
                                    class_weight = class_weights,
                                    callbacks = [checkpoint]
                                    )
    
    return CNN_history


- k fold training

In [None]:
### find best ratio range: for every single lower ratio lim analyse training behavior

def k_fold_training(df, n_epochs = 3, k = 5, batch_size = 32, stratify = "Yes", strat = "id"):

    
    # inits of lists for storing results of metrics and loss 
    k_fold_history = {}
    
    
    # get X_y_dict with folds 
    if stratify == "No":
        print("Unstratified Sampling")
        X_y_dict = get_k_fold_unstratified(df,  k)
    else:
        print("Stratified Sampling")
        X_y_dict = get_k_fold_stratified(df,  k, strat = strat)
    
    # traning loop
    for j in range(k):
        
        print("\n")
        print(color.BOLD + color.RED +  "k fold = " + str(j+1) + color.END  + "\n")
        print("============================================================")
        print("\n")
        print(color.BOLD +  "Get k Fold Concat train and validation sets for fold k = " + str(j+1) + color.END + "\n")
        
        k_fold_j = "k_fold_" + str(j+1) 
        
        X_train  = pd.DataFrame()
        X_val    = pd.DataFrame()
 
        # get training set and validation set
        print("Concatening single folds to train and validation")
        for k_fold_i, X_i in X_y_dict.items():
            
            if k_fold_i == k_fold_j:
                X_val = X_i
            else:
                X_train = pd.concat([X_train, X_i])
                        
        print("")
        print("Data preparation for next fold for training finished!")
        print(k_fold_j,"- training : len X_train:", len(X_train), "/", round(len(X_train)/(len(X_train)+len(X_val))*100,0),"%" , " ||  len X_val:", len(X_val),
                                    "/",round(len(X_val)/(len(X_train)+len(X_val))*100,0),"%", " 1/k = ", round(1/k*100,0), "%")
        print("")
        
        # construct data generator
        print("Generator Output:")
        train_generator, validation_generator = get_data_generator_train(
                                                                          X_train,
                                                                          X_val,
                                                                          label = "cell_type",
                                                                          color_mode = "rgb",
                                                                          preprocessing_function = None,
                                                                          shuffle = True,
                                                                          batch_size = batch_size,
                                                                          samplewise_center = False,
                                                                          samplewise_std_normalization = False, 
                                                                          horizontal_flip = False,
                                                                          vertical_flip = False,
                                                                          target_size = target_size
                                                                          )
      

        # model and training
        print("")
        print(color.BOLD +  "Training for fold k = " + str(j+1)  + color.END + "\n")
        
        
        # reset and recompile model to avoid independencies on k - fold training
        model = get_pretrained_model(target_size, num_classes, print_summary = True, trainable_last_layers = 1000)
        model = compile_model(model = model, optimizer = "Adam" , lr = 1e-5)

        history     =     model.fit(
                                      train_generator, 
                                      steps_per_epoch = int(len(X_train)/batch_size) ,
                                      validation_data =  validation_generator,
                                      epochs = n_epochs,
                                      class_weight = None
                                      )
        
       
        # storing results
        k_fold_history[k_fold_j] = history.history 
        print("\n")
    # end training loop        
    
    return k_fold_history 


# Evaluation / Plotting Functions

- plot metrics for single fold

In [None]:
### loss, accuracy and auc per epoch and training and validation set 

def show_metrics_single_fold(CNN_history):
    
    #accuracy
    plt.plot(CNN_history.history['accuracy'], color = 'blue')
    plt.plot(CNN_history.history['val_accuracy'], color = 'red')
    plt.title('CNN: Accuracy vs. epochs')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Training', 'Validation'], loc='upper left')
    plt.show()
        
    #loss
    plt.plot(CNN_history.history['loss'], color='blue')
    plt.plot(CNN_history.history['val_loss'], color='red')
    plt.title('CNN: Loss vs. epochs')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Training', 'Validation'], loc='upper left')
    plt.show()
    
    try:
        #AUC
        plt.plot(CNN_history.history['AUC'], color='blue')
        plt.plot(CNN_history.history['val_AUC'], color='red')
        plt.title('CNN: AUC vs. epochs')
        plt.ylabel('AUC')
        plt.xlabel('Epoch')
        plt.legend(['Training', 'Validation'], loc='upper left')
        plt.show()
    except:
        print("AUC not setted!")
    

- plot metrics for k fold

In [None]:
### loss, accuracy and auc per epoch and training and validation set per ratio limit analysis

def show_metrics_k_fold(k_fold_history):

    # inits
    fold = "k_fold_1" # dummy
    k    = len(k_fold_history.keys()) # number of folds
    max_ = 1e6
    n_epochs    = len(k_fold_history["k_fold_1"]["loss"]) # dummy
    
    
    # loop over metrics
    for metric in list(k_fold_history[fold].keys()):
        
        train_min  = list(np.ones(n_epochs) * max_)
        val_min    = list(np.ones(n_epochs) * max_)
        train_max  = list(np.zeros(n_epochs))
        val_max    = list(np.zeros(n_epochs))
        
        train_mean_array = np.zeros((k,n_epochs))
        val_mean_array  = np.zeros((k,n_epochs))
        
        if metric.find("val") < 0:
 
            plt.figure(figsize=(12,8))
            
            # loop over folds
            for j in range(k):
                
                k_fold = "k_fold_" + str(j+1)
                
                # find min max and mean for train and val along all folds
                train_min = np.minimum(k_fold_history[k_fold][metric], train_min)
                train_max = np.maximum(k_fold_history[k_fold][metric], train_max)
                
                val_min   = np.minimum(k_fold_history[k_fold]["val_" + metric], val_min)
                val_max   = np.maximum(k_fold_history[k_fold]["val_" + metric], val_max)
                
                train_mean_array[j,:]    = k_fold_history[k_fold][metric]
                val_mean_array[j,:]      = k_fold_history[k_fold]["val_" + metric]
                
                
                # plot every metric per fold
                plt.plot(k_fold_history[k_fold][metric],          color = 'red',  linestyle = ':') 
                plt.plot(k_fold_history[k_fold]["val_" + metric], color = 'deepskyblue', linestyle = ':')
                
            
            # gettin mean value per matric for every epoch
            train_mean = np.mean(train_mean_array, axis = 0)
            val_mean   = np.mean(val_mean_array, axis = 0)
                  
            plt.plot(train_max,  color='crimson', label="train") 
            plt.plot(train_mean, color='darkred', linewidth=5)  
            plt.plot(train_min,  color='crimson')
            
            plt.plot(val_max,   color='dodgerblue', label = "val")
            plt.plot(val_mean,  color='darkblue',  linewidth=5)
            plt.plot(val_min,   color='dodgerblue')
            
            plt.fill_between(range(n_epochs), val_max,  val_min,    alpha=0.5, color='skyblue')
            plt.fill_between(range(n_epochs), train_max, train_min, alpha=0.5, color='pink')
            
            plt.title(f'{metric}  vs. epochs')
            plt.ylabel(f'{metric}')
            plt.xlabel('epoch')
            if metric == "loss":
                plt.legend(loc='upper right')
            else:
                plt.legend(loc='upper left')
            plt.show()
            
            # print additional results for last epoch:
            print("")
            print(color.BOLD +  "results for last epoch for", metric, ":" + color.END  )
            print("train mean: " , round(train_mean[-1],2), " - train max: ",  round(train_max[-1],2),  " - train min: ",  round(train_min[-1], 2))
            print("val mean:   " , round(val_mean[-1],2), " - train max: ",  round(val_max[-1],2),  " - train min: ",  round(val_min[-1], 2))
            print("=====================================================================================================")
            print("\n")


# Training Process / Evaluation

- load data

In [None]:
get_train_semi_supervised_csv(path = "../input/sartorius-cell-instance-segmentation/train_semi_supervised", print_df = True, store_on_drive = False)

In [None]:
train_df = load_df(return_df = True) 

- get tiles

In [None]:
train_df = making_tiles(train_df, path = './', len_tile = 128, tiling = True)

- Downsampling

In [None]:
train_df_res = resampling(train_df, resampling_on = "DOWN" , label = "cell_type")

- get model and train k fold with stratified sampling (along image id)

In [None]:
k_fold_history = k_fold_training(df = train_df_res, n_epochs = 7, k = 5, batch_size = batch_size, stratify = "Yes", strat = "id")

In [None]:
show_metrics_k_fold(k_fold_history)

- final trainig on whole data ( approx. by a very small test size)

In [None]:
X_y_train, X_y_val = single_fold(train_df_res, label = "cell_type", test_size = 0.01)

In [None]:
train_datagen, valid_datagen = get_data_generator_train(
                                                        X_y_train, 
                                                        X_y_val,
                                                        label = "cell_type",
                                                        color_mode = "rgb",
                                                        preprocessing_function = None,
                                                        shuffle = True,
                                                        batch_size = batch_size,
                                                        samplewise_center = False,
                                                        samplewise_std_normalization = False,
                                                        horizontal_flip = True,
                                                        vertical_flip = True,
                                                        target_size = target_size
                                                        )


In [None]:
CNN = get_pretrained_model(target_size, num_classes, print_summary = True, trainable_last_layers = 1000)

In [None]:
CNN = compile_model(model= CNN, optimizer = "Adam" , lr = 1e-5) 

In [None]:
fit_model_cnn(model = CNN, train_generator = train_datagen, validation_generator = valid_datagen, n_epochs = 5, class_weights = None)