# Training GCN
Train a Graph Neural Network with the histology + gene information.

We are going to create a brain layer classifier using [Torch Geometric's GCN](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.GCN.html?highlight=gcn#torch-geometric-nn-models-gcn)

## Loading the data

For a detailed exploration and analysis of what the data actually contains, visit the `DataAnalysis` notebook, located in the same directory as this one.

### Downloading the data

In [None]:
%%sh
./dataset/getdata.sh

### Actual loading

In [None]:
import sys, os
sys.path.append(os.path.abspath("src"))

In [None]:
%load_ext autoreload
%autoreload 2
import importlib
import preprocess

data_dir, img_dir, graph_dir = "dataset/data", "dataset/images", "out/graphs"
ann_data, histology_imgs = preprocess.main(data_dir, img_dir, graph_dir)

## Features used for training

We are going to use the following features for training:

**Edge Features**:
- Spatial Connectivities` between spots
- Pixel `distance` (adjusted by color)

**Node Features**:
- `UMI` count (log)
- `Color` in the `neighbourhood` of the spot


## PyTorch Geometric's data structure

From the official [Documentation](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.GCN.html?highlight=gcn#torch-geometric-nn-models-gcn) , we can see that:

A single graph in PyG is described by an instance of `torch_geometric.data.Data`, which holds the following attributes by default:

- `data.x`: Node feature matrix with shape `[num_nodes, num_node_features]`

- `data.edge_index`: Graph connectivity in `COO` format with shape `[2, num_edges]` and type `torch.long`

- `data.edge_attr`: Edge feature matrix with shape `[num_edges, num_edge_features]`

- `data.y`: Target to train against (may have arbitrary shape), e.g., node-level targets of shape `[num_nodes, *]` or graph-level targets of shape `[1, *]`

- `data.pos`: Node position matrix with shape `[num_nodes, num_dimensions]`

## Creating the required data structures for training

We need to convert the data to what's required by PyTorch geometric.

In [None]:
import torch

### data.edge_index
We need to transform to a `PyTorch` tensor in `COO` format.
Let's start with a reference patient:

In [None]:
ann_data['151676'].obsp['spatial_connectivities']

In [None]:
type(ann_data['151676'].obsp['spatial_connectivities'])

In [None]:
coo_matrix = ann_data['151676'].obsp['spatial_connectivities'].tocoo()
coo_matrix

In [None]:
type(coo_matrix)

In [None]:
coo_connections = { patient: data.obsp['spatial_connectivities'].tocoo()  \
                   for patient, data in ann_data.items() }
for patient, coo in coo_connections.items():
    print(f"Patient {patient}: {coo.shape}")

In [None]:
edge_indices = {}

for patient, coo in coo_connections.items():
    row = torch.from_numpy(coo.row).long()
    col = torch.from_numpy(coo.col).long()
    edge_indices[patient] = torch.stack([row, col], dim=0)

    print(f"{patient}: {edge_indices[patient].shape}")

## data.edge_attr
Edge feature matrix with shape `[num_edges, num_edge_features]`.
For now, only get the distances for the ones that are spatially connected.

In [None]:
import numpy as np
import os

edge_features = {}
for patient in ann_data.keys():
    filename = str(f"{patient}_adj.npy")
    adj_distances = np.load(os.path.join(graph_dir, filename))
    adj_tensor = torch.from_numpy(adj_distances)
    row, col = edge_indices[patient]
    distances = adj_tensor[row, col]
    edge_features[patient] = distances.unsqueeze(1).float()
    print(f"{patient}: {edge_features[patient].shape}")

## data.x
Node feature matrix with shape `[num_nodes, num_node_features]`

In [None]:
import scanpy as sc

node_features = {}
for patient, data in ann_data.items():
    sc.pp.normalize_total(ann_data[patient])
    sc.pp.log1p(ann_data[patient])

    node_features[patient] = ann_data[patient].X.todense()
