# Milieu

Milieu is a disease protein discovery algorithm based on the hypothesis that proteins associated with the same disease share mutual interactors in the protein-protein interaction network.   

In [1]:
%load_ext autoreload
%autoreload 2

import os


from milieu.data.network import PPINetwork
from milieu.data.associations import load_diseases
from milieu.milieu import MilieuDataset, Milieu

os.chdir("/dfs/scratch0/sabri/milieu")
#os.chdir("/Users/sabrieyuboglu/Documents/sabri/research/milieu")



## Load the PPI Network

We use the protein-protein interaction network compiled by Menche *et al.*[1]. The network consists of 342,353 interactions between 21,557 proteins. Se
In `data/networks`, you can find this network `bio-pathways-network.txt`. See methods for a more detailed description of the network. 
You can also find two other protein-protein interaction networks `string-network.txt` and `bio-grid-network.txt`. See Supplementary Note 3 for a detailed description.

In [2]:
network = PPINetwork("data/networks/bio-pathways-network.txt")

## Build the *Milieu* Model

We use params

In [3]:
params = {
    "cuda": True,
    "device": 2,
    
    "batch_size": 200,
    "num_workers": 4,
    "num_epochs": 10,
    
    "optim_class": "Adam",
    "optim_args": {
        "lr": 1,
        "weight_decay": 0
    },
    
    "metric_configs": [
        {
            "name": "recall_at_25",
            "fn": "batch_recall_at", 
            "args": {"k":25}
        }
    ]
}

In [4]:
milieu = Milieu(network, params)

Milieu
Setting parameters...
Building model...
Building optimizer...
Done.


## Train the Model
*Milieu* is trained on a large set of known disease-protein associations. We use

In [5]:
diseases = list(load_diseases("data/associations/disgenet-associations.csv", exclude_splits=["none"]).values())
train_diseases = diseases[:int(len(diseases)* 0.9)]
valid_diseases = diseases[int(len(diseases)* 0.9):]
train_dataset = MilieuDataset(network, diseases=train_diseases)
valid_dataset = MilieuDataset(network, diseases=valid_diseases)

In [16]:
train_metrics, valid_metrics = milieu.train_model(train_dataset, valid_dataset)

Starting training for 10 epoch(s)
Epoch 1 of 10
Training


100%|██████████| 9/9 [00:13<00:00,  1.17s/it, loss=0.911]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.29s/it]

Epoch 2 of 10
Training



100%|██████████| 9/9 [00:12<00:00,  1.13s/it, loss=0.910]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.08s/it]

Epoch 3 of 10
Training



100%|██████████| 9/9 [00:12<00:00,  1.13s/it, loss=0.910]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.16s/it]

Epoch 4 of 10
Training



100%|██████████| 9/9 [00:12<00:00,  1.15s/it, loss=0.916]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.23s/it]

Epoch 5 of 10
Training



100%|██████████| 9/9 [00:12<00:00,  1.13s/it, loss=0.909]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.14s/it]

Epoch 6 of 10
Training



100%|██████████| 9/9 [00:12<00:00,  1.13s/it, loss=0.902]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.09s/it]

Epoch 7 of 10
Training



100%|██████████| 9/9 [00:12<00:00,  1.12s/it, loss=0.902]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.21s/it]

Epoch 8 of 10
Training



100%|██████████| 9/9 [00:12<00:00,  1.12s/it, loss=0.896]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.10s/it]

Epoch 9 of 10
Training



100%|██████████| 9/9 [00:13<00:00,  1.20s/it, loss=0.909]


Validation


100%|██████████| 1/1 [00:02<00:00,  2.16s/it]

Epoch 10 of 10
Training



100%|██████████| 9/9 [00:13<00:00,  1.20s/it, loss=0.893]

Validation



100%|██████████| 1/1 [00:02<00:00,  2.26s/it]


## Predict Novel Associations

In [8]:
cholecystitis_proteins = ['ENG', 'ALDOA', 'GDF2', 'GPI', 'HK1', 'SMAD4','ARSA', 
                          'ABCB4', 'PKLR', 'BPGM', 'TPI1', 'ACVRL1']

In [11]:
milieu.discover(genbank_ids=cholecystitis_proteins, top_k=25)

[('TGFBR3', 0.99688125),
 ('HKDC1', 0.99394965),
 ('INHA', 0.99321705),
 ('TGFB2', 0.99226123),
 ('INHBC', 0.9908999),
 ('BMP2', 0.9901833),
 ('BMP6', 0.9877698),
 ('GCK', 0.98390573),
 ('ACVR2A', 0.9797916),
 ('ACVR2B', 0.9777183),
 ('PGAM2', 0.9767097),
 ('IGSF1', 0.9760065),
 ('GDF5', 0.96969706),
 ('PGAM1', 0.9685065),
 ('RGMB', 0.9672417),
 ('MGP', 0.9670852),
 ('HK3', 0.9670852),
 ('INHBA', 0.9664927),
 ('PGK2', 0.9661868),
 ('GDF7', 0.9655239),
 ('BMP10', 0.95562977),
 ('INHBB', 0.94661707),
 ('CHRDL2', 0.9445417),
 ('FMOD', 0.94226974)]

In [14]:
arnold_proteins = ["FGFR1", "ERF", "MKS1", "POR", "FGFR3", "FGFR2", "NOTCH2", "PTCH1", "ZIC1"]

In [15]:
milieu.discover(genbank_ids=arnold_proteins, top_k=25)

[('PHEX', 0.9999964),
 ('COL3A1', 0.99689615),
 ('ZIC3', 0.9864237),
 ('FGFRL1', 0.9861942),
 ('COL6A1', 0.97802424),
 ('NRP1', 0.9757894),
 ('DUSP3', 0.9686838),
 ('FGFBP1', 0.96764284),
 ('PTCH2', 0.9665226),
 ('ZIC2', 0.96509117),
 ('HSPG2', 0.9603956),
 ('DESI1', 0.950775),
 ('FGF9', 0.94697315),
 ('DLL3', 0.9399293),
 ('CSHL1', 0.9371621),
 ('COL2A1', 0.9358387),
 ('COL1A1', 0.9207403),
 ('STK36', 0.8977488),
 ('PAX2', 0.87458116),
 ('CYB5A', 0.87166786),
 ('FGFR4', 0.85650355),
 ('STIL', 0.85361814)]

1. Menche, J. et al. Uncovering disease-disease relationships through the incomplete interactome. Science 347, 1257601–1257601 (2015).
2.