# PointCloud to Graph Protein Lifting Tutorial

***
This notebook shows how to import UniProt protein data and convert it to a graph using the `PointCloudToGraph` class. Proteins are represented as point clouds where each point is a residue in the protein, setting CarbonAlpha as its centers. The graph is created by connecting residues that are close to each other in the 3D space or that appear in a sequential order.

The target is the mass of each protein.

The notebook is divided into sections:

- [Loading the dataset](#loading-the-dataset) loads the config files for the data and the desired tranformation, creates a dataset object and visualizes it.
- [Loading and applying the lifting](#loading-and-applying-the-lifting) definding the edges by the following way:
    - **Sequentialwise**: Connecting residues that appear in a sequential order (one after another). This approach is based on the presence of peptide bonds, which link the amino acids in a protein chain in a specific sequence.
    - **KNN**: Connecting residues that are close to each other in the 3D space. This approach is based on the physical proximity of the residues in the protein structure.
- [Create and run a simplicial nn model](#create-and-run-a-simplicial-nn-model) simply runs a forward pass of the model to check that everything is working as expected.

***
***

Note that for simplicity the notebook is setup to use a point cloud. 

With this submission, **UniProt** protein dataset is available and loaded as a point cloud, based on PDB files.
***

### Imports and utilities

In [16]:
import sys

sys.path.append("../..")

In [17]:
# With this cell any imported module is reloaded before each cell execution
%load_ext autoreload
%autoreload 2
from modules.data.load.loaders import PointCloudLoader
from modules.data.preprocess.preprocessor import PreProcessor
from modules.utils.utils import (
    describe_data,
    load_dataset_config,
    load_model_config,
    load_transform_config,
)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading the dataset

Here we just need to specify the name of the available dataset that we want to load. First, the dataset config is read from the corresponding yaml file (located at `/configs/datasets/` directory), and then the data is loaded via the implemented `Loaders`.


In [18]:
dataset_name = "UniProt"
dataset_config = load_dataset_config(dataset_name)
loader = PointCloudLoader(dataset_config)


Dataset configuration for UniProt:

{'data_domain': 'pointcloud',
 'data_type': 'UniProt',
 'data_name': 'UniProt',
 'data_dir': 'datasets/pointcloud/UniProt',
 'query': 'length:[95 TO 155]',
 'format': 'tsv',
 'fields': 'accession,length',
 'size': 20,
 'num_features': 20,
 'num_classes': 1,
 'task': 'regression',
 'loss_type': 'mse',
 'monitor_metric': 'mae',
 'task_level': 'graph'}


In [19]:
dataset_name = "UniProt"
dataset_config = load_dataset_config(dataset_name)
loader = PointCloudLoader(dataset_config)


Dataset configuration for UniProt:

{'data_domain': 'pointcloud',
 'data_type': 'UniProt',
 'data_name': 'UniProt',
 'data_dir': 'datasets/pointcloud/UniProt',
 'query': 'length:[95 TO 155]',
 'format': 'tsv',
 'fields': 'accession,length',
 'size': 20,
 'num_features': 20,
 'num_classes': 1,
 'task': 'regression',
 'loss_type': 'mse',
 'monitor_metric': 'mae',
 'task_level': 'graph'}


We can then access to the data through the `load()`method:

In [20]:
dataset = loader.load()

PDB file for O14960 already exists.
PDB file for O14907 already exists.
PDB file for O14519 already exists.
PDB file for O60519 already exists.
PDB file for O75379 already exists.
PDB file for A6NNB3 already exists.
PDB file for O60814 already exists.
PDB file for C9JLW8 already exists.
PDB file for O43914 already exists.
PDB file for A2RU14 already exists.
PDB file for O75956 already exists.
PDB file for O15116 already exists.
PDB file for O14933 already exists.
PDB file for O00453 already exists.
PDB file for A6NFY7 already exists.
PDB file for O00422 already exists.
PDB file for O15540 already exists.
PDB file for O15511 already exists.
PDB file for O95139 already exists.
PDB file for A8MQ03 already exists.
PDB file for O14960 already exists.
PDB file for O14907 already exists.
PDB file for O14519 already exists.
PDB file for O60519 already exists.
PDB file for O75379 already exists.
PDB file for A6NNB3 already exists.
PDB file for O60814 already exists.
PDB file for C9JLW8 already 

## Loading and Applying the Lifting

In this section we will instantiate the lifting we want to apply to the data. For this example the knn lifting was chosen. The algorithm takes the k nearest neighbors for each node and creates a hyperedge with them. Moreover, the algorithm also creates an edge for each sequential pair of residues.


In [21]:
# Define transformation type and id
transform_type = "liftings"
# If the transform is a topological lifting, it should include both the type of the lifting and the identifier
transform_id = "pointcloud2graph/knn_lifting"

# Read yaml file
transform_config = {
    "lifting": load_transform_config(transform_type, transform_id)
    # other transforms (e.g. data manipulations, feature liftings) can be added here
}


Transform configuration for pointcloud2graph/knn_lifting:

{'transform_type': 'lifting',
 'transform_name': 'PointCloudKNNLifting',
 'max_cell_length': None,
 'preserve_edge_attr': False,
 'feature_lifting': 'ProjectionSum',
 'k_value': 10,
 'loop': False}


We than apply the transform via our `PreProcesor`:

In [24]:
lifted_dataset = PreProcessor(dataset, transform_config, loader.data_dir)
describe_data(lifted_dataset)

Transform parameters are the same, using existing data_dir: /home/bmiquel/Documents/Projects/Topo/challenge-icml-2024/datasets/pointcloud/UniProt/UniProt/lifting/1540663474

Dataset contains 20 samples.

Providing more details about sample 0/20:
 - Graph with 151 vertices and 1730 edges.
 - Features dimensions: [20, 0]
 - There are 0 isolated nodes.



## Create and Run a Cell NN Model

In this section a simple model is created to test that the used lifting works as intended. A graph neural network from torch_geometric is used.

In [25]:
from modules.models.graph.graphsage import GraphSAGEModel

model_type = "graph"
model_id = "graphsage"
model_config = load_model_config(model_type, model_id)

model = GraphSAGEModel(model_config, dataset_config)


Model configuration for graph GRAPHSAGE:

{'in_channels_0': None,
 'in_channels_1': None,
 'in_channels_2': None,
 'hidden_channels': 32,
 'out_channels': None,
 'n_layers': 2}


In [26]:
y_hat = model(lifted_dataset.get(0))

  return torch.nn.functional.softmax(global_mean_pool(z, None))


If everything is correct the cell above should execute without errors. 