# COLLABORATION WELCOME!
***If the ideas, code, features, or predictions below are interesting or helpful to you, please reach out! Given how different this appraoch is to the common Python + Deep Learning approach, there might be opportunity for us to team up and ensemble our models. I'm releasing my code and ideas publicly here, but if you use significant parts of this to make your solution better, please consider inviting me to your team ^_^***

## CHANGELOG

### Update 2021.03.27

I've overhauled the notebook to be aimed more at an actual submission.
- It can run offline, thanks to the downloads being moved to another notebook.
- I've implemented some hefty feature engineering, including spatially-aware features, in Julia.
- I've moved the modeling over to LightGBM in Python.
- I show off my RLE encoding/decoding skills to move segmentation masks from Julia to Python (I use numba to JIT compile Python code to get a 100x speedup for reading RLE masks)

## Introduction

This notebook is a bit atypical for a Kaggle image competition. First, it's written purely in Julia, not Python. Second, it uses no deep learning. If this sounds interesting to you, please continue reading!

### Goals

The primary goal of this notebook is to take a crack at the competition from scratch (segmentation and prediction) with a an atypical set of methods and a different programming language and package ecosystem. The secondary goal is to try and run everything fast on a CPU. 

### Backstory

I'm new to the Julia programming language, but I'm very excited about it. I got interested in this competition from a friend just after wrapping up my first project in Julia, and I thought to myself, "hey, let's take a crack at doing this in Julia, just as a learning exercise." Since it's clunky actually developing Julia in Kaggle notebooks (no autocomplete, can't interrup long-running computation, etc.), I started working locally on my laptop, which means no GPU. It's harder (but possible!) to write Julia code for GPU than CPU, anyhow, and as I'm just starting to figure things out, I figured I'd try to make my approach efficient enough to run on a laptop CPU. Thus the goals above came about naturally.

## Installing Julia
Before we can run any code, we have to install the Julia language. Sadly, even though Julia is the "Ju" in **Ju**pyter, Kaggle no longer supports Julia Kernels directly, so we have to use Julia through a Python kernel, which is a bit painful.

In order to make this notebook work offline, I separated most of the installation steps into [another notebook, which you can take a look at if you're curious](https://www.kaggle.com/lukemerrick/julia-download).

In [None]:
!cd ../input/julia-download/install_julia && sh install.sh

In [None]:
!pip install --no-index --find-links ../input/pycocotools/ pycocotools

In [None]:
import os
from multiprocessing import cpu_count
os.environ["JULIA_NUM_THREADS"] = str(cpu_count())
import julia
from julia.api import Julia
# cannot use precompiled packages with pyjulia on linux :-(
#     but we can use a system image with PyCall pre-compiled to speed that up
jl = Julia(
    sysimage="../input/julia-download/install_julia/python_julia_sysimage.so",
    compiled_modules=False
)
%load_ext julia.magic

In [None]:
%%julia
# confirm we're using more than one thread
using Base.Threads: nthreads
nthreads()

## "Installing" home-baked dependencies
In addition to making a semi-polished public package, I also created modules to organize utility functions. Unfortunately this buries the logic for segmentation a bit. I'm planning to come back and run the lower-level functions and better explain the segmentation details, but for now, I'm working to getting things going end-to-end.

In [None]:
!mkdir ProteinAtlas

In [None]:
%%writefile ProteinAtlas/ProteinAtlas.jl
module ProteinAtlas

import
    CellSegmentation,
    FileIO,
    ImageContrastAdjustment,
    ImageCore,
    ImageSegmentation

using ColorTypes: Gray, HSV, N0f8, RGB
using CellSegmentation: CellImage, CellImageNoAxes, cell_image, nearest_neighbor_resize
using ImageTransformations: imresize
using Statistics: quantile, median, mean

export
    HPA_CHANNELS,
    load_example,
    rgb_hpa_image,
    HPASegmentation,
    segment_cells,
    SegmentedCell,
    segmentation_image,
    extract_cell,
    resize_cell


const HPA_CHANNELS = [:mt, :er, :nu, :pr]

"""
Loads all four channels of an example as Grayscale images.
* red microtubule "mt" channel
* yellow endoplasmic reticulum "er" channel
* blue nucleus "nu" channel
* green protein of interest "pr" channel

Then offers a StackedView of them together.
"""
function load_example(dir::String, key::String)::CellImage
    mt = FileIO.load(joinpath(dir, "$(key)_red.png"))
    er = FileIO.load(joinpath(dir, "$(key)_yellow.png"))
    nu = FileIO.load(joinpath(dir, "$(key)_blue.png"))
    pr = FileIO.load(joinpath(dir, "$(key)_green.png"))
    example_data = Array{Gray{N0f8},3}(
        undef, (size(mt)..., 4)
    )
    example_data[:, :, 1] = mt
    example_data[:, :, 2] = er
    example_data[:, :, 3] = nu
    example_data[:, :, 4] = pr
    return cell_image(example_data, HPA_CHANNELS)
end

"""
Convert a HPAImage to RGB for visualization.
    * red for the microtubule channel
    * yellow for the endoplasmic reticulum channel
    * blue for the nucleus channel
    * green for the protein channel

Options
-------
`include_er`: Toggles the yellow er channel, so you can have true RGB of mu, pr, nu
`match_intensity_histograms`: Scales the brightness of each channel to match the average
    across channels. Generally makes the image more boring, but also easier to
    see locations of all three channels (no one channel should dominate).
`gamma_adjust`: Gamma adjustment. Between dropping to 0.5-0.8 can brighten
    dark stuff. Especially useful when match_intensity_histograms=true.
`brightness_scale`: Linearly scale pixel values up by this amount
    (e.g. 1.5 is a 50% increase in birghtness)
"""
function rgb_hpa_image(
    image::Union{CellImage,CellImageNoAxes};
    include_er::Bool=true,
    match_intensity_histograms::Bool=false,
    gamma_adjust::Real=1.0,
    brightness_scale::Real=1.0
)::Array{RGB{N0f8},2}
    # restore channel axes if missing
    if isa(image, CellImageNoAxes)
        image = cell_image(image, HPA_CHANNELS)
    end

    # drop :er channel if we're skipping it
    if !include_er
        image = image[channel=[:mt, :pr, :nu]]
    end

    # optionally rescale each channel so its color histogram matches that of
    # the average across all channels
    if match_intensity_histograms
        image = deepcopy(image)
        channel_avg = mean(image; dims=3)[:, :, 1]
        adjustment = ImageContrastAdjustment.Matching(; targetimg=channel_avg)
        for channel_slice in eachslice(image; dims=3)
            ImageContrastAdjustment.adjust_histogram!(channel_slice, adjustment)
        end
    end

    # red, green, and blue are channels directly
    red = image[channel=:mt]
    green = image[channel=:pr]
    blue = image[channel=:nu]

    # we treat yellow is as an even mix of red and green by adding to both channels
    if include_er
        red += image[channel=:er] / 2
        green += + image[channel=:er] / 2
        ImageCore.clamp01!(red)
        ImageCore.clamp01!(green)
    end
    result = RGB{N0f8}.(red, green, blue)
    gamma_correction = ImageContrastAdjustment.GammaCorrection(gamma=gamma_adjust)
    ImageContrastAdjustment.adjust_histogram!(result, gamma_correction)
    result = RGB{N0f8}.(ImageCore.clamp01.(brightness_scale .* result))
    return result
end

struct HPASegmentation
    nuclei_segmentation_map::Array{Int64,2}
    cell_segmentation_map::Array{Int64,2}
end

"""
Segments nuclei and uses that to segment the whole cells.

Algorithm used is marker-based watershed.

Options
-------
`resize_px`: Internally the image is downscaled to resize_px x resize_px for performance.
    Smaller is faster, but too small may give poor results. 
`nucleus_bg_arguments`: Arguments passed to `get_channel_background` for computing
    nucleus background.
`cell_bg_arguments`: Arguments passed to `get_channel_background` for computing
    whole cell background.
`nucleus_marker_quantile`: Quantile of nonzero distance values to use in
    identifying nucleus segmentation markers. Higher will result in
    smaller/fewer markers. Too high and small/noisy nuclei will be missed, too
    low and close nuclei may merge.
"""
function segment_cells(
    image::CellImage;
    resize_px::Int64=512,
    nucleus_bg_arguments::Dict{Symbol,Float64}=Dict(
        :median_filter_window_size => 0.005,
        :background_nonzero_quantile => 0.2,
        :area_opening_window_size => 0.02
    ),
    cell_bg_arguments::Dict{Symbol,Float64}=Dict(
        :median_filter_window_size => 0.01,
        :background_nonzero_quantile => 0.2,
        :area_opening_window_size => 0.02
    ),
    nucleus_marker_quantile::Float64=0.4,
)::HPASegmentation
    # downscale image 
    small_image = cell_image(imresize(image, (resize_px, resize_px)), HPA_CHANNELS)
    # segment nuclei using marker-based watershep
    nucleus_background = CellSegmentation.get_background_mask(
        small_image[channel=:nu].data; nucleus_bg_arguments...
    )
    nucleus_seg = CellSegmentation.segment_via_background(
        nucleus_background; watershed_marker_quantile=nucleus_marker_quantile
    )
    nucleus_seg_map = ImageSegmentation.labels_map(nucleus_seg)

    # segment full cells
    cell_grayscale = (
        small_image[channel=:nu].data
        + small_image[channel=:er].data
        + small_image[channel=:mt].data
    )
    cell_background = CellSegmentation.get_background_mask(
        cell_grayscale; cell_bg_arguments...
    )
    cell_seg = CellSegmentation.segment_via_background(
        cell_background; markers=nucleus_seg_map
    )
    cell_seg_map = ImageSegmentation.labels_map(cell_seg)

    # upscale segmentation maps
    full_size = size(image)[1:2]
    nucleus_seg_map = Int.(nearest_neighbor_resize(nucleus_seg_map, full_size))
    cell_seg_map = Int.(nearest_neighbor_resize(cell_seg_map, full_size))
    hpa_segmentation = HPASegmentation(nucleus_seg_map, cell_seg_map)
end

# create evenly spaced colors palette using HSV across various brightness levels
n_angles = 8
n_saturations = 3
colors = [
    HSV(θ + (i % 2) * 360/n_angles/2, 0.5 + i / n_saturations / 2, 0.7)
    for θ in 360/n_angles:360/n_angles:360, i in n_saturations:-1:1
][:]
pushfirst!(colors, HSV(0, 0, 0))

# define function to max out saturation of nuclei pixels
max_brightness(color::HSV)::HSV = HSV(color.h, color.v, 1)

"""
Create HSV image of a segmentation map
"""
function segmentation_image(seg::HPASegmentation)::AbstractMatrix{HSV}
    seg_image = map(i -> colors[i % length(colors) + 1], seg.cell_segmentation_map)
    nucleus_mask = seg.nuclei_segmentation_map .> 0
    seg_image[nucleus_mask] = max_brightness.(seg_image[nucleus_mask])
    return seg_image
end

struct SegmentedCell
    image::CellImage
    nucleus_mask::BitMatrix
    cell_mask::BitMatrix
end

"""
Extract the contents of a segment mask and place it on a black background.

Also returns indices of pixesl in the nucleus and indices of pixels in the whole cell.
"""
function extract_cell(
    nucleus_mask::AbstractArray,
    cell_mask::AbstractArray,
    full_image::CellImage
)::SegmentedCell
    nucleus_indices = findall(nucleus_mask)
    cell_indices = findall(cell_mask)
    idx_min = minimum(cell_indices)
    idx_max = maximum(cell_indices)
    shift = idx_min - CartesianIndex(1, 1)
    shifted_cell_indices = cell_indices .- (shift,)
    shifted_nucleus_indices = nucleus_indices .- (shift,)
    max_dim = maximum((idx_max - idx_min).I) + 1
    channel_names = full_image.axes[3].val
    extracted_image = zeros(Gray{N0f8}, (max_dim, max_dim, length(channel_names)))
    extracted_image = cell_image(extracted_image, channel_names)
    extracted_image[shifted_cell_indices, :] = full_image[cell_indices, :]
    out_nucleus_mask = falses(max_dim, max_dim)
    out_cell_mask = falses(max_dim, max_dim)
    out_nucleus_mask[shifted_nucleus_indices] .= 1
    out_cell_mask[shifted_cell_indices] .= 1
    return SegmentedCell(extracted_image, out_nucleus_mask, out_cell_mask)
end

function resize_cell(cell::SegmentedCell, new_size)
    return SegmentedCell(
        CellSegmentation.cell_image(imresize(cell.image, new_size), HPA_CHANNELS),
        Bool.(nearest_neighbor_resize(cell.nucleus_mask, new_size)),
        Bool.(nearest_neighbor_resize(cell.cell_mask, new_size))
    )
end

end

# TODO: Walk through individual components
In this latest rounds of updates, I'm pushing to getting my work running end-to-end (including submission of predictions). In future updates, I plan to come back and show step-by-step how the segmentation and feature engineering work.

# Putting it all together (at least the Julia part)
Below is a script that handles the loading, segmentation, and feature-encoding of all the cell images. As output, we get feature vectors for all the cells identified by segmenting all the train and test images, as well as JSON files which contain RLE-encoded segmentation masks for all the test cells.

In [None]:
%%time
%%julia
@info "Running Julia imports. Compiling can take a few minutes, so be patient."
import
    CellSegmentation,
    CSV,
    FileIO,
    ImageMorphology,
    ImageSegmentation,
    JSON,
    Parquet,
    Random

using Base.Threads: @threads
using CellSegmentation: nearest_neighbor_resize, CellImage
using ColorTypes: RGB, N0f8, HSV, Gray
using DataFrames
using ImageCore: clamp01
using ImageTransformations: imresize
using MosaicViews: mosaicview
using ProgressMeter: Progress, next!, @showprogress
using Statistics: quantile, std, mean, cor
using StatsBase: counts, rle

# local modules
push!(LOAD_PATH, "./ProteinAtlas")
using ProteinAtlas

In [None]:
%%time
%%julia
const N_TRAINING_IMG = 20
@info "Starting main process! JIT compiling can take a long time, so be patient."
@info "Subsetting to $N_TRAINING_IMG images."

const TRAIN_DIR = "../input/hpa-single-cell-image-classification/train"
const TEST_DIR = "../input/hpa-single-cell-image-classification/train"
const TRAIN_CSV = "../input/hpa-single-cell-image-classification/train.csv"

struct RLEMask
    mask_nrow::Integer
    mask_ncol::Integer
    run_values::Vector{<:Real}
    run_lengths::Vector{<:Integer}
    function RLEMask(mask::BitMatrix)
        mask_nrow, mask_ncol = size(mask)
        run_values, run_lengths = rle(mask[:])
        return new(mask_nrow, mask_ncol, UInt8.(run_values), run_lengths)
    end
end


function sample_rows(df; n=10, seed=0)
    rng = Random.MersenneTwister(seed)
    n = min(n, nrow(df))
    indices = Random.randperm(nrow(df))[1:n]
    return df[indices, :]
end

"""
Split up a blob mask (nucleus or whole cell) into rings and pie slices
"""
function region_segment(mask:: AbstractMatrix; n_ring = 4, n_angle=8)
    height, width = size(mask)
    bg_mask = .!mask
    distance_from_bg = ImageMorphology.distance_transform(
        ImageMorphology.feature_transform(bg_mask)
    )
    lo, hi = extrema(distance_from_bg[:])
    slice_width = (hi - lo) / n_ring
    ring_masks = falses(height, width, n_ring)
    for i in 1:n_ring
        ring_masks[:, :, i] = (
            lo + slice_width * (i - 1) .< distance_from_bg .<= lo + slice_width * i
        )
    end
    
    # angle slices
    _, center = findmax(distance_from_bg)
    center_r, center_c = center.I
    offset_min = CartesianIndex(1 - center_r, 1 - center_c)
    offset_max = CartesianIndex(height - center_r, width - center_c)
    polar_angles = [atan(x.I...) for x in offset_min:offset_max]
    slice_angle = 2 * π / n_angle
    slice_masks = falses(height, width, n_angle)
    for i in 1:n_angle
        slice = @. ((i - 1) * slice_angle - π <= polar_angles < i * slice_angle - π)
        slice_masks[:, :, i] = slice .& mask
    end
    return ring_masks, slice_masks
end


"""
Segment the bright "speckles" in the protein channel
"""
function segment_bright_protein_shapes(protein_channel; bright_quantile=0.98)
    nonzero_pr = protein_channel .> 0
    threshold = any(nonzero_pr) ? quantile(protein_channel[nonzero_pr], bright_quantile) : 0.0
    protein_mask = protein_channel .> threshold
    seg = ImageMorphology.label_components(protein_mask)
    return seg
end


"""
Extract a fixed-length vector of numerical summary statistics for a region
    - min, max, median, iqr, mean, std of each color channel
    - bi-channel correlation for each color channel
    - protein speckles per pixel
    - protein speckle size distribution: max, median, iqr, mean, std
"""
function summarize_region(
    image::CellImage,
    region_mask::AbstractMatrix,
    protein_speckles::AbstractMatrix;
    aggregations=(minimum, maximum, mean, std),
    quantiles=(0.05, 0.25, 0.5, 0.75, 0.95)
)::Vector{Float32}
    result = Float32[]
    region_pixels = Float32.(image[region_mask, :])
    region_pixels = reshape(region_pixels, size(region_pixels)[1:2])  # squeeze
    sort!(region_pixels; dims=1)
    segment_speckles = protein_speckles[region_mask]

    # univariate summaries of pixel information
    for agg_fn in aggregations
        append!(result, agg_fn(region_pixels; dims=1)[:])
    end

    # quantiles
    for channel_pixels in eachslice(region_pixels; dims=2)
        append!(result, quantile(channel_pixels, quantiles; sorted=true))
    end

    # bi-channel correlations
    pixel_correlations = cor(region_pixels; dims=1)[
        [CartesianIndex(i, j)
        for i in Base.OneTo(size(region_pixels, 2))
            for j in Base.OneTo(i - 1)]
    ]
    append!(result, pixel_correlations)

    # protein speckle information
    speckle_sizes = counts(segment_speckles)[2:end]  # drop background (which is value 0)
    speckle_sizes = speckle_sizes[speckle_sizes .!= 0]  # drop empty
    if length(speckle_sizes) == 0
        speckle_frac = 0
        speckle_frac_per_pixel = 0
        speckle_aggregations = zeros(length(aggregations))
        speckle_quantiles = zeros(length(quantiles))
        speckles_per_pixel = 0
    else
        n_total_speckles = maximum(protein_speckles)
        n_pixels_in_region = length(segment_speckles)
        # fraction of total speckles in this region
        speckle_frac = length(speckle_sizes) / n_total_speckles
        speckle_frac_per_pixel = speckle_frac / n_pixels_in_region
        speckle_aggregations = [f(speckle_sizes) for f in aggregations]
        speckle_quantiles = quantile(speckle_sizes, quantiles)
        speckles_per_pixel = length(speckle_sizes) / n_pixels_in_region
    end
    push!(result, speckle_frac)
    push!(result, speckle_frac_per_pixel)
    append!(result, speckle_aggregations)
    append!(result, speckle_quantiles)
    push!(result, speckles_per_pixel)
    replace(result, NaN => 0)
    return result
end

fill_na_div(a, b) = replace!(a ./ b, NaN => 1, Inf => 10, -Inf => -10)


"""
Extracts a fixed-length vector of numerical summary statistics for a cell_image
    - region-level statistics for cell, nucleus, and nonnucleus
    - (max - min) range of region-level statistics across ring and slice regions of the nucleus
    - (max - min) range of region-level statistics across ring and slice regions of the cell
    - ratios of region-level statistics and ranges 
"""
function summarize_cell(cell::SegmentedCell)
    features = Float32[]
    protein_speckles = segment_bright_protein_shapes(cell.image[channel=:pr])
    cell_summary = summarize_region(cell.image, cell.cell_mask, protein_speckles)
    nucleus_summary = summarize_region(cell.image, cell.nucleus_mask, protein_speckles)
    nonnucleus_mask = cell.cell_mask .& .!cell.nucleus_mask
    nonnucleus_summary = (
        any(nonnucleus_mask) ?
        summarize_region(cell.image, nonnucleus_mask, protein_speckles) :
        cell_summary  # copy the cell/nucleus summary if no nonnucleus region exists
    )
    nucleus_vs_nonnucleus = fill_na_div(nucleus_summary, nonnucleus_summary)
    append!(features, cell_summary)
    append!(features, nucleus_summary)
    append!(features, nonnucleus_summary)
    append!(features, nucleus_vs_nonnucleus)
    for ((ring_masks, slice_masks), summary) in (
        (region_segment(cell.cell_mask), cell_summary),
        (region_segment(cell.nucleus_mask), nucleus_summary),
    )
        ring_summaries = cat(
            (
                summarize_region(cell.image, region_mask, protein_speckles)
                for region_mask in eachslice(ring_masks, dims=3)
                if sum(region_mask) > 0
            )...;
            dims=2
        )
        slice_summaries = cat(
            (
                summarize_region(cell.image, region_mask, protein_speckles)
                for region_mask in eachslice(slice_masks, dims=3)
                if sum(region_mask) > 0  # happens when cell on edge of image
            )...;
            dims=2
        )
        ring_range = [max - min for (min, max) in extrema(ring_summaries; dims=2)]
        slice_range = [max - min for (min, max) in extrema(slice_summaries; dims=2)]
        normalized_ring_range = fill_na_div(ring_range, summary)
        normalized_slice_range = fill_na_div(slice_range, summary)
        ring_vs_slice = fill_na_div(ring_range, slice_range)
        append!(features, normalized_ring_range)
        append!(features, normalized_slice_range)
        append!(features, ring_vs_slice)
    end
    return features
end

function load_and_featurize(dir, key)
    feature_vectors = Vector{Float32}[]
    rle_masks = RLEMask[]
    image = load_example(dir, key)
    seg = segment_cells(image; resize_px=400)
    n_cells = maximum(seg.cell_segmentation_map)
    for cell_number in 1:(n_cells - 1)
        nucleus_mask = seg.nuclei_segmentation_map .== cell_number
        cell_mask = seg.cell_segmentation_map .== cell_number
        cell = extract_cell(nucleus_mask, cell_mask, image)
        cell = resize_cell(cell, min(size(cell.nucleus_mask), resize_size))
        try
            push!(feature_vectors, summarize_cell(cell))
            push!(rle_masks, RLEMask(cell_mask))  # only push the RLE mask if featurization succeeds
        catch
            @error "Error featurizing cell $cell_number of image $key"
        end
    end
    feature_matrix = nothing
    if length(feature_vectors) > 0
        feature_matrix = mapreduce(transpose, vcat, feature_vectors)
    else
        @error "No feature vectors for image $key"
    end
    return feature_matrix, rle_masks
end


#####
##### Key-label data loading and sampling
#####

# load train.csv and parse/multihot-encode labels resulting in columns :id, :0, :1, ...
@info "Loading train.csv"
train = CSV.read(TRAIN_CSV, DataFrame)
parse_labels(column) = [parse(Int, x) for x in split(column, "|")]
select!(train, :ID => :id, :Label => ByRow(parse_labels) => :label)
sample_train = (
    N_TRAINING_IMG === nothing ?
    train :
    sample_rows(train, n=N_TRAINING_IMG)

)


#####
##### Training image loading/segmentation/featurization
#####
@info "[Train] Loading, segmenting, and featurizing cells from $(nrow(sample_train)) images"
resize_size = (200, 200)
progress = Progress(nrow(sample_train); desc="Images...")
res = Vector(undef, nrow(sample_train))
enumerated_itr = collect(enumerate(sample_train[!, :id]))

# precompile trigger
load_and_featurize(TRAIN_DIR, enumerated_itr[1][2])

# parallel processing of load/segment/extract/featurize
@threads for (i, key) in enumerated_itr
    feature_matrix, rle_masks = load_and_featurize(TRAIN_DIR, key)
    res[i] = (key, feature_matrix)
    next!(progress)
end

# remove empty results
@info "Removing empty results for images with no cells found"
i = 1
while i <= length(res)
    if res[i][2] === nothing
        popat!(res, i)
        global i = 1
    end
    i += 1
end

# create dataframe of features giant list of masks
@info "Concatenating results into feature table"
feature_df_list = DataFrame[]
for (key, feature_matrix) in res
    df = DataFrame(feature_matrix)
    insertcols!(df, 1, :key => key)
    push!(feature_df_list, df)
end
feature_df = vcat(feature_df_list...)

@info "Writing features to parquet"
Parquet.write_parquet("train_features.parquet", feature_df)


#####
##### Test image loading/segmentation/featurization
#####
test_image_keys = unique([
    String(split(x, "_")[1])
    for x in readdir(TEST_DIR)
    if endswith(x, ".png")
])
if N_TRAINING_IMG !== nothing
    test_image_keys = test_image_keys[1:N_TRAINING_IMG]
end

@info "[Test] Loading, segmenting, and featurizing cells from $(length(test_image_keys)) images"
progress = Progress(length(test_image_keys); desc="Images...")
res = Vector(undef, length(test_image_keys))
enumerated_itr = collect(enumerate(test_image_keys))

# parallel processing of load/segment/extract/featurize
@threads for (i, key) in enumerated_itr
    feature_matrix, rle_masks = load_and_featurize(TEST_DIR, key)
    res[i] = (key, feature_matrix, rle_masks)
    next!(progress)
end

# remove empty results
@info "Removing empty results for images with no cells found"
i = 1
while i <= length(res)
    if res[i][2] === nothing
        popat!(res, i)
        global i = 1
    end
    i += 1
end

# create dataframe of features giant list of masks
@info "Concatenating results into feature table and RLE-encoded mask list"
feature_df_list = DataFrame[]
rle_mask_list = RLEMask[]
for (key, feature_matrix, rle_masks) in res
    df = DataFrame(feature_matrix)
    insertcols!(df, 1, :key => key)
    push!(feature_df_list, df)
    append!(rle_mask_list, rle_masks)
end
feature_df = vcat(feature_df_list...)

@info "Writing features to parquet"
Parquet.write_parquet("test_features.parquet", feature_df)

@info "Writing rle masks to JSON"
# JSON requires the entire file to be read at once, so we need to chunk it
#  if we want the reader to be able to read some at a time
out_dir = "rle_masks"
rm(out_dir; force=true)
mkpath(out_dir)
masks_per_file = 5_000
for (file_i, mask_i) in enumerate(1:masks_per_file:length(rle_mask_list))
    end_i = min(mask_i + masks_per_file - 1, length(rle_mask_list))
    write(joinpath(out_dir, "$file_i.json"), JSON.json(rle_mask_list[mask_i:end_i]))
end

println("Done!")

## Modeling in Python

I've decided to switch over to Python for the modeling section here at the end.

Why? I have now completed Julia code to segment the images and compute spatially-aware statistics that encode each cell as a fixed-length numeric feature vector, and the last step of fitting a model is relatively small, after all. However, I have started losing steam on this project, and doing the remainder in Julia would take much more work than just using Python.

For one, I'm familiar with several popular ML libraries in Python, but I'm not yet familiar with any of the Julia ML ecosystem. In a previous version of this notebook, I just used a Newton-Raphson Logistic Regression solver I wrote from scratch in Julia. Unfortunately, with my more serious feature enginering code I now have over 500 features per image, about 30x as many features as before. The Hessian inversion in the full Newton-Raphson solver could cost $30^3 = 27,000$ times the compute, and so it doesn't seem feasible to continue using that solver.

Additionally, it seems pretty nonsensical to try and re-implement in Julia the complicated segmentation mask encoding function needed for submission. Instead, it's much easier to export a more standard RLE encoding of the masks into JSON from Julia, and then to use Python to re-encode via the function provided by the organizers and the `pycocotools` library.


The Python code below basically goes through the typical GBM baseline approach to non-computer-vision Kaggle competitions. We load the features into DataFrames, feed that into LightGBM, and boom, we get predictions. There is a little nuance in that we train one model per class independently (a naive approach ignoring class interaction)

In [None]:
import base64
import itertools
import json
import multiprocessing
import zlib
from functools import reduce
from pathlib import Path

import lightgbm as lgb
import numpy as np
import pandas as pd
from numba import njit
from pycocotools import _mask as coco_mask
from sklearn.metrics import average_precision_score
from sklearn.model_selection import KFold
from tqdm import tqdm

In [None]:
# load and align the training data
input_df = pd.read_parquet("train_features.parquet").set_index("key")
target_df = pd.read_csv("../input/hpa-single-cell-image-classification/train.csv")
expanded_labels = target_df["Label"].str.split("|", expand=True).astype("float")
target_classes = pd.Series(reduce(np.union1d, expanded_labels.apply(pd.unique))).dropna().astype("int")
multihot_labels = pd.DataFrame(
    {
        tc: expanded_labels.eq(tc).any(axis=1)
        for tc in target_classes
    }
)
target_df = pd.concat([target_df["ID"].rename("key"), multihot_labels], axis=1).set_index("key")
target_df = target_df.loc[input_df.index]
print("Input data excerpt")
display(input_df.iloc[:3, :20].T)
print("Target data excerpt")
display(target_df.head(3).T)

In [None]:
# load test data
test_input_df = pd.read_parquet("test_features.parquet").set_index("key")
print("Test input excerpt")
display(test_input_df.iloc[:3, :20].T)

In [None]:
# load and re-encode masks

def rle_to_mask(mask_nrow, mask_ncol, run_values, run_lengths):
    """NOTE: even with itertools, this is super slow and takes ~1sec per mask."""
    return np.array(tuple(itertools.chain.from_iterable(
        itertools.repeat(v, r)
        for v, r in zip(run_values, run_lengths)
    ))).reshape((mask_nrow, mask_ncol), order="F")


@njit
def _rle_to_mask_1d(length, run_values, run_lengths):
    """Python loops are slow, but LLVM-compiled with numba is fast.
    (~100x faster than `rle_to_mask`)
    """
    res = np.empty(length, dtype=np.uint8)
    i = 0
    for v, r in zip(run_values, run_lengths):
        for _ in range(r):
            res[i] = v
            i += 1
    return res


def fast_rle_to_mask(mask_nrow, mask_ncol, run_values, run_lengths):
    # note: it's faster to convert the lists to numpy arrays
    #     rather than having the numba-jitted code take in Python lists
    length_1d = mask_nrow * mask_ncol
    res = _rle_to_mask_1d(length_1d, np.array(run_values), np.array(run_lengths))
    return res.reshape((mask_nrow, mask_ncol), order="F")


def binary_mask_to_ascii(mask):
    # convert input mask to expected COCO API input --
    mask_to_encode = mask.reshape(mask.shape[0], mask.shape[1], 1)
    mask_to_encode = mask_to_encode.astype(np.uint8)
    mask_to_encode = np.asfortranarray(mask_to_encode)

    # RLE encode mask --
    encoded_mask = coco_mask.encode(mask_to_encode)[0]["counts"]

    # compress and base64 encoding --
    binary_str = zlib.compress(encoded_mask, zlib.Z_BEST_COMPRESSION)
    base64_str = base64.b64encode(binary_str)
    return base64_str.decode()


def reencode_mask(mask_dict):
    return binary_mask_to_ascii(fast_rle_to_mask(**mask_dict))


competition_mask_strings = []
image_width_heights = []
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
    for mask_file_path in sorted(Path("rle_masks/").iterdir()):
        print(f"Parsing mask file {mask_file_path}")
        with mask_file_path.open() as f:
            rle_mask_json = json.load(f)
        reencoded_masks = pool.imap(reencode_mask, tqdm(rle_mask_json))
        competition_mask_strings.extend(reencoded_masks)
        image_width_heights.extend([(mask["mask_ncol"], mask["mask_nrow"]) for mask in rle_mask_json])

In [None]:
# Train one model per class

# # big model for accuracy
# model_params = dict(
#     n_estimators=1_000,
#     learning_rate=0.02,
#     num_leaves=127,
#     colsample_bytree=0.1,
#     subsample=0.9,
#     subsample_freq=1,
# )

# small model for testing
model_params = dict(
    n_estimators=30,
    learning_rate=0.3,
    num_leaves=7,
    colsample_bytree=0.1,
    subsample=0.9,
    subsample_freq=1,
)

# train models on each class
X = input_df.to_numpy()
models = []
for i, target_class in enumerate(tqdm(target_df.columns)):
    y = target_df[target_class].to_numpy()
    model = lgb.LGBMClassifier(**model_params)
    model.fit(X, y)
    models.append(model)

In [None]:
# run predictions
X = test_input_df.to_numpy()
pred_df = pd.DataFrame(
    {
        target_class: model.predict_proba(X)[:, 1]
        for target_class, model in zip(target_df.columns, models)
    },
    index=test_input_df.index
)
print("Prediction excerpt")
pred_df.head(3).T

In [None]:
# create submission

# let's choose thresholds so that the rate we guess a particular label is
# a fixed multiple of the class frequency
thresholds = {}
rate_multiple = 2
base_rates = target_df.mean()
threshold_rates = rate_multiple * base_rates
for target_class, predictions in pred_df.iteritems():
    quantile = min(1, threshold_rates[target_class])
    thresholds[target_class] = predictions.quantile(quantile)
display(pd.DataFrame(dict(
    base_rates=base_rates, threshold_rates=threshold_rates, threshold_values=thresholds
)))

In [None]:
submission_rows = []
for mask_string, (width, height), pred_tuple in zip(
    tqdm(competition_mask_strings),
    image_width_heights,
    pred_df.itertuples(index=True, name=None)
):
    key, *pred = pred_tuple
    for target_class, pred in zip(target_df.columns, pred):
        if pred > thresholds[target_class]:
            prediction_string = f"{target_class} {pred:.20f} {mask_string}"
            row = {
                "ID": key,
                "ImageWidth": width,
                "ImageHeight": height,
                "PredictionString": prediction_string
                
            }
            submission_rows.append(row)
submission_df = pd.DataFrame.from_records(submission_rows)
submission_df.to_csv("submission.csv", index=False)
submission_df.head()

## Conclusion

So here we have it, no pre-training, no GPU, all thanks to Julia.

I hope this work is inspiring to the Julia community. I've been a career-long Python programmer, and this is only my second project with the language (my first being a [library to download and parse stock data from IEX](https://github.com/lukemerrick/InvestorsExchange.jl)). Already I feel that I can try out so many things that just aren't feasible in Python/numpy, like non-ML segmentation algorithms, hand-implemented logistic regression code, and building custom visualizations of images. To any more veteran members of the Julia community who are reading through, I would deeply appreciate any feedback and constructive criticism on my code. I'm still a beginner trying to learn, and every tip helps!

I also hope this work is inspiring to folks who aren't afraid to think creatively in the face of cookie-cutter deep learning model frameworks. Certainly I doubt my segmentation code gives results as clean as those from the offical [HPA Cell Segmentation](https://github.com/CellProfiling/HPA-Cell-Segmentation) project, but it gives solid results and runs quite efficiently (<1sec to segment a 400x400-resized image on a single core of a laptop CPU, most of the runtime is actually from doing higher-resolution feature computation). Not to devalue all the creativity and endless tweaking that goes into making a competitive deep learning submission, but I hope more folks out there will try more "way out there" approaches like this.


### Opportunities for future work

Collaboration welcome!

- The first thing I want to do is go back and explain what all the Julia code does, with lots of images showing the intermediate steps of segmentation, spatial feature engineering, etc.

- When it comes to improving my submission, I think there are some fairly simple things to try adding to the GBM modeling. 
    - It would be good to take into account the correlation between classes and the fact taht when multiple classes are present together, the features look different than when the classes are present individually. One thing to try would be finding the top target interactions (i.e. pairs of classes that show up together) and training a model that predicts the presence/absence of the interaction.
    - Filtering the training data would be good, too, since a lot of cells in images may not actually match the label for the whole image. One simple approach would be to get a set of cross-validated predictions using the full data, and then drop the labels from cells whcih are predicted with significantly lower confidence than the other cells in the same image.

- A less unconventional next step would be to get some autograd and convolutions going, but in Julia rather than Python. [Flux.jl](https://github.com/FluxML/Flux.jl) seems like a fun time, but also like another big chunk of work to dive into, so I'll have to see.

- There's also the interesting technical challenge of seeing how much of my Julia code I can port over to run on GPU. It would probably require re-implementing the underlying Watershed algorithm from [ImageSegmentation.jl](https://juliaimages.org/v0.20/imagesegmentation/), which seems both daunting and interesting. It seems [possible](https://github.com/louismullie/watershed-cuda).