## Part 2 - Analysis of Whole Slide Image Nuclei 

<img src="./images/rapids.png" alt="RAPIDS" style="width: 200px;"/>

In the previous notebook, using Monai Label server to generate nuclear segmentations was mentioned. This data can yield rich spatial information, along with numerous other properties, such as the shapes of all the nuclei.
There are a range of tools at our disposal when it comes to the analysis of this data and, in this section, we are going to look at a few of the tools and techniques that you can use to find new perspectives on the data you have.

We will start by using the output from running HoVerNet on the H & E stained image that comes from the [Xenium](https://www.10xgenomics.com/platforms/xenium) samples [here](https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast). You could either use the HoVerNet pipeline from MONAI to run inference on a Whole Slide, or you can use MonaiLabel, which is what was done in this case. The output from the MonaiLabel segmentation was an xml file with the outlines of each nucleus represented as a polygon, using the annotation format created by Radboud UMC for their [ASAP](https://computationalpathologygroup.github.io/ASAP/) annotation tool.

There are many different formats for this type of data, but they are all doing a similar thing. One interesting feature of this type of data is that it is used extensively by the geospatial community. For example, if you want to create a map of a country then some sort of polygon, or set of polygons if it spans separate regions, might be used to describe the perimeter of the country.

This type of representation is useful for mapping any large 2D spaces, including tissue slides. 

RAPIDS and other ML frameworks use tabular data rather than raw pixels, but can handle vector data using certain extensions. The tabular data is held in DataFrames. For CPU memory, Pandas is used, but RAPIDS provides a GPU accelerated Pandas equivalent, which is cuDF (CUDA Data Frame). Pandas has a special extension named GeoPandas, which encodes the spatial data in a specific format and provides high-performance analysis functions. Similarly, cuDF has cuSpatial, which is the GPU equivalent.


As usual, we start by importing the libraries that we'll need. As you will notice, there are a few new names here, such as cuDF, cuGraph and cuML. These are the core of the RAPIDS tools offering GPU accelerated Dataframe functionality, GPU accelerated graph analytics and GPU accelerated Machine Learning routines

You will find documentation on all of these libraries and features here https://docs.rapids.ai/api

# Using RAPIDS to unlock clinically valuable insights from Spatial Data

In [None]:
##%load_ext cudf.pandas
from cuml import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from cuml.neighbors import NearestNeighbors
import cuxfilter
from cuxfilter import DataFrame, layouts
import pandas as pd
import cudf
from cudf import DataFrame
import cugraph
import cuml
import numpy as np
import cuspatial
import geopandas as gpd
from shapely.geometry import shape, point, polygon
import cupy
from shapely import centroid, Polygon, Point
import rmm
rmm.reinitialize(managed_memory=True)

You can start by taking a look at the .xml file that MonaiLabel emitted

In [None]:
!head -n 50 "./data/he-registered-to-dapi-shlee-patch-151_46_7311_5368.xml"

The format needed for GeoPandas to work is a little different. 

In order to convert the xml that we have, we need to convert each cell annotation into a shapely.shape and add it into a GeoPandas dataframe, similar to the previous notebook, which is what we will do in the next cell

In [None]:
import xml.etree.ElementTree as ET
import json

# Load the file and parse the XML
filename = "./data/he-registered-to-dapi-shlee-patch-151_46_7311_5368.xml"
tree = ET.parse(filename)
root = tree.getroot()

cells=[]
types=[]
cx=[]
cy=[]

for child in root:
    if "Description" in child.attrib:

        for annotation in child:
            #only 1 coords per annotation
            for coords in annotation:
                pgon = {}
                pgon["type"]= "Polygon"
                pgon["coordinates"] = []
                pts = []
                for coord in coords:
                     pts.append((float(coord.attrib['X']), float(coord.attrib['Y'])))
                pgon["coordinates"].append(pts)
                shp = shape(pgon)
                cells.append(shp)
                types.append(annotation.attrib["Name"])
                cx.append(centroid(shp).x)
                cy.append(centroid(shp).y)


df_slide = gpd.GeoDataFrame({'cell_type': types, 'x': cx,'y': cy, 'geometry':cells})  

df_slide

So, you can see that we now have a dataframe with 20,000 cell polygons. The plotting can then be done for all of these points, from which you can see the structure of the original tissue section

In [None]:
df_slide.plot()

With this view, we can't see much detail, but by filtering the view we can effectively zoom in on an area. Let's start by finding all of the cells within a specific region. 
To do this we need to use some of the features that GeoPandas provides. Shapely has been optimized for these types of query too, so the performance is very good for modest numbers of objects.

In [None]:
# define a square polygon region within the coordinates
region = shape({'type': 'Polygon', 'coordinates': [[(1000.0, 1000.0), (2000.0, 1000.0), (2000.0, 2000.0), (2000.0, 1000.0), (1000.0, 1000.0)]]})

# obtain True False values for each row - if it falls within the region
df_mask = df_slide.within(region)

df_mask

Now we can use the loc method to locate all rows in which the mask is True. If we plot this out we get the requested region

In [None]:
# apply the mask to filter the rows
df_zoom = df_slide.loc[df_mask]

df_zoom.plot(column='cell_type')

Okay, so let's imagine that we need to map each cell to a known grid that contains information about the proteins or genes found at each location. This could help to, say, accuractely map transcripts to specific cells and allow some analysis of the networks of cells and how they interact.

How might we go about doing this? Well, from an algorithic point of view there are several ways of doing this but let's start with the simplest solution and work back from there. What might a 'within polygon' function look like? It's actually not a simple algorithm but luckily shapely has a 'contains' function that we can use.

In [None]:
geoms = []
geoms.append(Point(6, 7))
geoms.append(Point(5, 2))
geoms.append(Point(10, 8))
geoms.append(Point(7, 14))

# define a tricky polygon!
coords = [(2, 9), (7, 1), (12, 8), (7, 16), (6, 15),(10,9), (7,4), (5,7), (7,9), (7,11), (2,9)]
for p in geoms:
    print(Polygon(coords).contains(p))

# add the polygon to the list
geoms.append(Polygon(coords))    

# Create a GeoPandas dataframe with its Geometry column set to contain
# the polygons and points 
df = gpd.GeoDataFrame({'shape_type': ['xy','xy','xy','xy','poly'],'geometry':geoms})
df.plot(column='shape_type')


So, this works. 

Can you extend this so that it works for every point in the 12 x 16 grid? ([Solution](./solutions/solution3.py))

Add the points and the polygons into a single geoseries column named 'geometry' with the column named 'shape_type', with values set to either 'poly' or 'xy'

In [None]:
# TODO 

# ...
df.plot(column='shape_type')

To get the points within the polygon, we would then do something like this:

In [None]:
# create dataframes for the polygon and points
df1 = df[df['shape_type']=='poly']
df2 = df[df['shape_type']=='xy']

# create a join to return the points within the polygon
df2.sjoin(df1, how="inner", predicate="within").plot()

...so that seems to work okay. Let's scale this up to 1,000,000 coordinates and 20532 polygons and see how it fares...

In [None]:
# create a dataframe for the 
n_points = 1000
from timeit import default_timer as timer
x = np.linspace(0, n_points-1, num=n_points)
y = np.linspace(0, n_points-1, num=n_points)
X, Y = np.meshgrid(x, y)

start=timer()
geoms = [Point(p[0], p[1]) for p in np.column_stack((X.reshape(-1), Y.reshape(-1)))]
print(timer()-start)

# Create a GeoPandas dataframe with its Geometry column set to contain
# the polygons and points 
start=timer()
types = ['xy']*(len(geoms))
df_grid = gpd.GeoDataFrame({'shape_type':types ,'geometry':geoms})
print(timer()-start)

start=timer()
d_within = df_grid.sjoin(df_slide, how="inner", predicate="within")
print(timer()-start)
d_within

In [None]:
d_within.plot()

This is still pretty good - most of the time is actually being spent creating the coordinate dataframe. Next let's see how it does when we increase the number of coordinates again.

In [None]:
# create a dataframe for the 
n_points = 3000
from timeit import default_timer as timer
x = np.linspace(0, n_points-1, num=n_points)
y = np.linspace(0, n_points-1, num=n_points)
X, Y = np.meshgrid(x, y)

start=timer()
geoms = [Point(p[0], p[1]) for p in np.column_stack((X.reshape(-1), Y.reshape(-1)))]
print("Time to generate points = {}".format(timer()-start))

# Create a GeoPandas dataframe with its Geometry column set to contain
# the polygons and points 
start=timer()
types = ['xy']*(len(geoms))
df_grid = gpd.GeoDataFrame({'shape_type':types ,'geometry':geoms})
print("Time to create polygons = {}".format(timer()-start))

start=timer()
d_within = df_grid.sjoin(df_slide, how="inner", predicate="within")
print("Time to find intersections = {}".format(timer()-start))
d_within

In [None]:
d_within.plot(marker=".")

Again, the performance is okay but with more data points everything is starting to slow down in the expected exponential fashion.
So, what geopandas is to pandas, cuspatial is to cudf. In other words it's a GPU accelerated version of geopandas. It uses the same underlying geoseries datastructure but is ideal for very large spatial problems that would otherwise hit a performance ceiling on CPU. Although the API is not as comprehensive as the geopandas equivalent, it still has some useful features for this sort of analysis.

You can create a cuspatial geoseries by pulling out the 'geometry' data from the dataframe. So we can create one for the grid and one for the polygons

The creation of the cuspatial datastructure actually takes quite if they are created using Shapely, but if we create a cuDF dataframe of the x/y coordinates first, this speeds things up hugely.

In [None]:
%%time
x = [float(x) for x in range(3000) for y in range(3000)]
y = [float(y) for x in range(3000) for y in range(3000)]
xy_df = cudf.DataFrame({"x": x, "y": y}).interleave_columns()  # assuming you have a dataframe with X and Y coords
cu_points = cuspatial.GeoSeries.from_points_xy(xy_df)

Although there are faster ways of creting the polygons too (which you can experiment with if you like), this only takes a few seconds.

In [None]:
%%time
cu_boxes = cuspatial.GeoSeries(df_slide['geometry'])

Next we can use a neat feature, similar to the spatial index in geopandas, which is to create quadtree index for the points. This makes searching through them much faster. It divides each level into a quadrant and, if there points within any of the quarters it will subdivide those recursively until the tree index is built. There are various parameters that control the speed and memory size of the tree.

The columns in the quadtree are:
* key - the index to the node
* level - how many levels down the tree this node is
* is_internal_node - indicates whether this is a leaf node or not
* length - the number of elements within this node
* offset - the index of the first point in the node

In [None]:
# Create a quadtree with scale of 1, a depth of 12 and a maximum number of points per node of 125
pi, qt = cuspatial.quadtree_on_points(cu_points, 0, 3000, 0, 3000, 1, 12, 125)

The output is an index of points and the tree itself.
To further limit the search we can generate a set of bounding boxes for each polygon (cell) and then look for intersections between points and bounding boxes. This reduces the search space by a significant factor


In [None]:
%%time
# Get bounding boxes for all polygons
poly_bboxes = cuspatial.polygon_bounding_boxes(cu_boxes)
# Next we generate a list of these intersections using 
# the join_quadtree_and_bounding_boxes method
intersections = cuspatial.join_quadtree_and_bounding_boxes(qt, poly_bboxes, 0, 3000, 0, 3000, 1, 12)
# FFind all of the points in each polygon limiting 
# the search to the intersections using the quadtree points
polygons_and_points  = cuspatial.quadtree_point_in_polygon(intersections, qt, pi, cu_points, cu_boxes)
polygons_and_points

We can then plot this out to by getting the data back into Geopandas (i.e. from GPU to CPU)

In [None]:
cu_points[pi[polygons_and_points['point_index']]].to_geopandas().plot(marker=".")

So, although the plot is not very clear, it is showing us only those points that fall within a cell.

RAPIDS has a lot of other features that we can use to examine the data we have so let's take a look at a few of these.
Next, we can create a list of centroids for each nucleus, which can allow us to do some different kinds of analysis

In [None]:
#create a cuda dataframe (GPU) from the pandas dataframe (CPU)
df1 = df_slide[['cell_type', 'x', 'y']]

cdf = cudf.DataFrame(df1)
cdf

We can also display some summary statistics for the dataframe columns

In [None]:
# get some stats on the data
cdf.describe()

To get a view of the data we can use MatplotLib to show us a scatter-plot using the x and y coordinates of the nuclei, using the type to set the colour of each point in the output. As you can see, this provides a clear indication of the distribution of the various types of nuclei across the tissue.
First, we need to convert the cell_type into a numeric value, for which we can use the LabelEncoder feature from sklearn.

In [None]:
from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
df1['cell_type_num'] = LE.fit_transform(df1['cell_type'])
df1

In [None]:
# create a matplotlib colormap 
cmap = ListedColormap(["blue", "gold", "lawngreen", "red"])

fig, ax = plt.subplots(1, figsize = (8, 6))
plt.scatter(df1['x'], df1['y'], s = 2, c = df1['cell_type_num'], cmap = cmap)

plt.title('Nuclei coloured by type')
plt.show()

This is certainly an interesting visualisation in its own right, but it doesn't tell us much about the relationships between the cells at a more granular level.

So, what we are going to do now is to create a graph of all the nuclei that we detected and classified. 
In order to construct the graph we are going to use the nearest neighbout algorithm to find the 5 nearest neighbours to each nucleus.

To do this we will need to import some classes from cuNN (A CUDA library for nearest neighbour computation) that is part of the RAPIDS family. 

In [None]:
from cuml.neighbors import NearestNeighbors

cdf = cudf.from_pandas(df1[['x', 'y','cell_type_num']])
knn_cuml = NearestNeighbors()
knn_cuml.fit(cdf)

%time D_cuml, I_cuml = knn_cuml.kneighbors(cdf, 5)
I_cuml, D_cuml

So, what this has produced is a list of the 5 nearest neighbours for each nucleus - using the row and col values, which relate to the position of the centroid of each nucleus in pixel space. The KNN (k-nearest neighbours) algorithm is using a Euclidean distance calculation (other algorithms can be used) which tells us how close each node is to every other node. Because we chose 5 as the number of nearest neighbours, we have a row value which represents the index of each node and then five columns containing the indexes of the 5 nearest nodes, with the nearest in column 0 and the furthest in column 4. You will also notice that the index in the nearest column, 0, always matches the row index. That's because the algorithm does not exclude each node from being its own nearest neighbour. We can ignore that column.

We are looking at the indexes in the I_cuml dataframe. To look at the physical distances, in pixels, you should look at the D_cuml dataframe.

If you want to compare this with the sklearn CPU implementation, be aware that whilst, for this many nodes, sklearn will work fine, when you start to deal with hundreds of thousands of nodes it can take 10s of minutes to run! RAPIDS is certainly your friend in such cases. 

In order to convert the output of the KNN operation into a graph, we need to prepare the data. The data needs to be presented to the RAPIDS graph library, cugraph, as a set of edges with the source and destination node and an optional weight parameter.

Firstly we combine the nearest neighbour indexes and distances into one dataframe and give them unique column names. We do this so that we can use the distance to set the weight of the connection between the nodes

In [None]:
# give the columns names because they have to be unique in the merged dataframe
# Indexes of neighbours
I_cuml.columns=['ix1','n1','n2','n3','n4'] 
# Distance to neighbours
D_cuml.columns=['ix2','d1','d2','d3','d4'] 
# Concatenate the columsn into a single dataframe
all_cols = cudf.concat([I_cuml, D_cuml],axis=1)

# remove the index and distance from the self-referenced nearest neighbour
all_cols = all_cols[['n1','n2','n3','n4','d1','d2','d3','d4']]

all_cols 

So the next step is to manipulate this data so that it is in the desired format. There should be 3 columns, named 'source', 'target' and 'distance'.

To do this, you will need to extract 4 sets of columns - one for each neighbour - and then concatenate the rows into a new dataframe.

Remember that each row index represents a node, the n1-n4 columns contain the row index of a destination node and the d1-d4 columns contain the distance between these nodes. 

In [None]:
# Reformat the data to match the way edges are defined in cuGraph
all_cols['index1'] = all_cols.index

c1 = all_cols[['index1','n1','d1']]
c1.columns=['source','target','distance']
c2 = all_cols[['index1','n2','d2']]
c2.columns=['source','target','distance']
c3 = all_cols[['index1','n3','d3']]
c3.columns=['source','target','distance']
c4 = all_cols[['index1','n4','d4']]
c4.columns=['source','target','distance']
                 
edges = [c1,c2,c3,c4]

edge_df = cudf.concat(edges)

# remove the old dataframe from memory
del(all_cols)

edge_df = edge_df.reset_index()
edge_df = edge_df[['source','target','distance']]
edge_df

Next, we need to set a maximum distance between connected nodes, so that we exclude any connections beyond a certain threshold. This may reveal groups of cells that are locally connected but separate from other 'cliques'. We will use a distance of 40 pixels to start off with but you can experiment with this setting.
Note that the distance calculation actually the distance squared (to save many expensive sqrt operations) - so we need to sqaure the threshold too.

In [None]:
edge_df = edge_df.loc[edge_df["distance"] < 40]
edge_df

This dataframe is now ready to be used to generate the graph. For this we use the cugraph library

In [None]:
# now we can actually create a graph!!
G = cugraph.Graph()

%time G.from_cudf_edgelist(edge_df,source='source', destination='target', edge_attr='distance', renumber=True)

Once we have the graph we can do standard graph analytical operations. Triangle Count is the number of cycles of length three. A k-core of a graph is a maximal subgraph that contains nodes of degree k or more. 

In [None]:
#now we can compute some graph metrics
count = cugraph.triangle_count(G)
print("No of triangles = " + str(count))

coreno = cugraph.core_number(G)
print("Core Number = " + str(coreno))

**Visualising the graph**

One powerful feature enabled by the RAPIDS API is the visualisation of large networks. To show this in action we are going to create a chart that displays all the nuclei centroids along with the edges between their nearest neighbours.

To do this we need two dataframes: One containing the nodes and their coordinates and the other with the edges and their source and target nodes

In [None]:
# we only need the index of the source and target nodes

# The indexes of the source and target nodes that form an edge
edge_df = edge_df[['source','target']]

# The x and y coordinates of each node (nucleus)
nodes_ = cdf[['x','y']]
# Vertex refers to the index of an item in the 
nodes_['vertex']=nodes_.index
nodes_.columns=['x','y','vertex']

Finally we use cuXFilter to render the whole graph!

In [None]:
cux_df = cuxfilter.DataFrame.load_graph((nodes_, edge_df))

chart0 = cuxfilter.charts.graph(edge_color_palette=['gray', 'black'],
                                            timeout=200, 
                                            node_aggregate_fn='mean', 
                                            node_color_palette=['blue'],
                                            edge_render_type='direct',
                                            edge_transparency=0.5
                                          )

d = cux_df.dashboard([chart0], layout=cuxfilter.layouts.double_feature)

# draw the graph
chart0.view()

You should be able to use the mouse-wheel to zoom in and out of the graph plot. If you zoom in far enough you will see the individual nodes (coloured) and edges (grey lines)


# Exercise
Can you create a data frame that contains the cell type ('cell type' - as a number) , the area ('area'), and the perimeter length ('length') of all the cells? ([Solution](./solutions/solution4.py))

In [None]:
 # TODO - create df1 dataframe with the specified columns
df1

If that worked, then you should be able to plot them by cell type below!

In [None]:
import seaborn as sns

sns.scatterplot(
    x="area", y="length",
    hue="cell_type_num",
    palette=sns.color_palette("hls", 4),
    data=df1,
    legend="full",
    alpha=0.7
)

The result is not terribly insightful in this case, but seaborn has many types of plot that you can use and there is also PCA and T-SNE provided by cuDF and sklearn that you can play around with.