# Milieu

Milieu is a disease protein discovery algorithm based on the hypothesis that proteins associated with the same disease share mutual interactors in the protein-protein interaction network.   

In [1]:
%load_ext autoreload
%autoreload 2

import os

import networkx as nx

from milieu.data.network import PPINetwork
from milieu.data.associations import load_diseases
from milieu.milieu import MilieuDataset, Milieu
from milieu.paper.figures.network_vis import show_network

#os.chdir("/dfs/scratch0/sabri/milieu")
os.chdir("/Users/sabrieyuboglu/Documents/sabri/research/projects/milieu/milieu")

## Load the PPI Network

We use the protein-protein interaction network compiled by Menche *et al.*[1]. The network consists of 342,353 interactions between 21,557 proteins. Se
In `data/networks`, you can find this network `bio-pathways-network.txt`. See methods for a more detailed description of the network. 
You can also find two other protein-protein interaction networks `string-network.txt` and `bio-grid-network.txt`. See Supplementary Note 3 for a detailed description.

In [2]:
network = PPINetwork("data/networks/bio-pathways-network.txt")

## Build the *Milieu* Model

We use params

In [3]:
params = {
    "cuda": False,
    "device": 2,
    
    "batch_size": 200,
    "num_workers": 4,
    "num_epochs": 10,
    
    "optim_class": "Adam",
    "optim_args": {
        "lr": 0.01,
        "weight_decay": 0.0
    },
    
    "metric_configs": [
        {
            "name": "recall_at_25",
            "fn": "batch_recall_at", 
            "args": {"k":25}
        }
    ]
}

In [4]:
milieu = Milieu(network, params)

Milieu
Setting parameters...
Building model...
Building optimizer...
Done.


## Train the Model
*Milieu* is trained on a large set of known disease-protein associations. We use

In [5]:
diseases = list(load_diseases("data/go_associations/go_function/associations.csv", exclude_splits=["none"]).values())
train_diseases = diseases[:int(len(diseases)* 0.9)]
valid_diseases = diseases[int(len(diseases)* 0.9):]
train_dataset = MilieuDataset(network, diseases=train_diseases)
valid_dataset = MilieuDataset(network, diseases=valid_diseases)

In [6]:
milieu.train_model(train_dataset, valid_dataset)

Starting training for 10 epoch(s)
Epoch 1 of 10
Training


100%|██████████| 3/3 [00:07<00:00,  2.46s/it, loss=1.373]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.38it/s]

Epoch 2 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.47s/it, loss=1.361]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.31it/s]

Epoch 3 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.46s/it, loss=1.353]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.37it/s]

Epoch 4 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.45s/it, loss=1.350]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.37it/s]

Epoch 5 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.45s/it, loss=1.340]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.15it/s]

Epoch 6 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.64s/it, loss=1.337]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.31it/s]

Epoch 7 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.59s/it, loss=1.337]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.34it/s]

Epoch 8 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.60s/it, loss=1.330]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.33it/s]

Epoch 9 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.53s/it, loss=1.327]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.36it/s]

Epoch 10 of 10
Training



100%|██████████| 3/3 [00:07<00:00,  2.54s/it, loss=1.323]

Validation



100%|██████████| 1/1 [00:00<00:00,  1.26it/s]


([{'recall_at_25': 0.18099820483749055},
  {'recall_at_25': 0.17019895259691176},
  {'recall_at_25': 0.19315635460533417},
  {'recall_at_25': 0.15077692743764173},
  {'recall_at_25': 0.1838776455026455},
  {'recall_at_25': 0.1591621720116618},
  {'recall_at_25': 0.14619331065759636},
  {'recall_at_25': 0.1802741874527589},
  {'recall_at_25': 0.16835479429867184},
  {'recall_at_25': 0.18434456322211423}],
 [defaultdict(list, {'recall_at_25': [0.19490476190476186]}),
  defaultdict(list, {'recall_at_25': [0.1587142857142857]}),
  defaultdict(list, {'recall_at_25': [0.2660476190476191]}),
  defaultdict(list, {'recall_at_25': [0.18538095238095237]}),
  defaultdict(list, {'recall_at_25': [0.1967142857142857]}),
  defaultdict(list, {'recall_at_25': [0.20485714285714285]}),
  defaultdict(list, {'recall_at_25': [0.2126666666666667]}),
  defaultdict(list, {'recall_at_25': [0.2353333333333333]}),
  defaultdict(list, {'recall_at_25': [0.244]}),
  defaultdict(list, {'recall_at_25': [0.1765238095238

## Predict Novel Associations

In [7]:
cholecystitis_proteins = ['ENG', 'ALDOA', 'GDF2', 'GPI', 'HK1', 'SMAD4','ARSA', 
                          'ABCB4', 'PKLR', 'BPGM', 'TPI1', 'ACVRL1']

In [8]:
diseases = load_diseases("data/go_associations/go_function/associations.csv", exclude_splits=["none"])

In [9]:
import random

In [12]:
keys = list(diseases.keys())
function = diseases[random.choice(keys)]
function = diseases['GO:0008235']
len(function.proteins)

14

In [13]:
function.name, function.id

('metalloexopeptidase activity', 'GO:0008235')

In [14]:
predicted_proteins = milieu.discover(entrez_ids=function.proteins, top_k=5)
predicted_proteins = list(zip(*predicted_proteins))[0]

In [15]:
cy_vis = show_network(network, function.proteins, predicted_proteins, id_format="entrez",
                      model=milieu,
                      show_seed_mi=True, excluded_interactions=[("mutual_interactor", "mutual_interactor")],
                      save_path=f"experiments/network_vis/function/{function.id}_cy.json")

In [16]:
cy_vis

Cytoscape(data={'elements': {'nodes': [{'data': {'role': 'seed', 'id': '3014', 'entrez': '4323', 'genbank': 'M…

1. Menche, J. et al. Uncovering disease-disease relationships through the incomplete interactome. Science 347, 1257601–1257601 (2015).
2.