# Data analysis with TogoDX: Network visualization

User story:
- 研究者が何かしらの分子を定量する実験を行う。例えば遺伝子発現解析。結果として興味のある分子のIDリストを得る。
- IDリストに掲載された分子がどのように互いに関わっているかを、統合データベースのアノテーションを用いて関係性を調べたい。
- TogoDX を使うことで ID のフィルタ・もしくはアノテーションができる（map your IDs）。
    - ID の数が多い場合は特定の条件で絞り込める (例: 遺伝子なら染色体、coding gene など)。
    - ID に対して興味のある attribute を選んでアノテーションを追加できる (例: 進化的保存度やパスウェイ、関連する疾患や薬剤の情報など)。
- TogoDX によって得られた (view result) 情報には前提条件がある
    - 二列目以降のカラムは primary key との繋がりは保証されているが、カラム間が直接繋がっているかどうかはわからない
    - 行と行の間の関係性はカラムのIDの一致によって繋がっていることがわかるが、テーブル形式では人間が解釈しづらい
- TogoDX によって絞り込み and/or アノテーションが付加されたデータに対してnotebookでネットワーク可視化を実行する
    - TogoDX のバックエンドにもなっている、IDとIDを接続する TogoID の情報を利用する
    - 同じ行に含まれるカラム同士がどのようなパスで繋がっているかがわかる
    - 行と行がどのようなルートで繋がっているか、あるいは独立しているかがわかる
    - 結果として、単なる分子のリストから、より重要な分子、あるいは分子のグループを見出すことができる。
        - ハブになっている分子、あるいは独立している分子を標的にした実験をデザインするなどのアクションを起こすことができる。
        
Reference:
- Enrichr-KG https://maayanlab.cloud/enrichr-kg
    - 趣旨は似ている、用意されているデータセット・可視化に使えるデータセット数に制限がある

## Import packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from functools import reduce
from scipy import stats

In [2]:
pd.set_option('display.max_rows', 100)

In [3]:
import sys
from urllib.request import urlopen
import json
import time

In [4]:
!{sys.executable} -m pip install pyyaml --quiet
import yaml

In [5]:
!{sys.executable} -m pip install pyvis --quiet
# https://pyvis.readthedocs.io/
from pyvis.network import Network

## Load dataset

In [6]:
data_path = "../data/togodx-20230217-12655.tsv"
d = pd.read_table(data_path)
d

Unnamed: 0,orig_dataset,orig_entry,orig_label,dest_dataset,dest_entry,node,value
0,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000379939,transcript_biotype_ensembl,protein coding
1,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000400445,transcript_biotype_ensembl,protein coding
2,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000537702,transcript_biotype_ensembl,protein coding
3,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000629018,transcript_biotype_ensembl,protein coding
4,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_evolutionary_conservation_homologene,"Insect, Worm"
...,...,...,...,...,...,...,...
4529,ensembl_gene,ENSG00000157933,SKI,uniprot,P12755,protein_disease_related_proteins_uniprot,Disease variant
4530,ensembl_gene,ENSG00000157933,SKI,uniprot,P12755,structure_data_existence_uniprot,Proteins with structure data
4531,ensembl_gene,ENSG00000157933,SKI,uniprot,P12755,interaction_proteins_in_pathway_reactome,Signaling Pathways
4532,ensembl_gene,ENSG00000157933,SKI,uniprot,P12755,interaction_proteins_in_pathway_reactome,Gene expression (Transcription)


In [7]:
d.describe()

Unnamed: 0,orig_dataset,orig_entry,orig_label,dest_dataset,dest_entry,node,value
count,4534,4534,4534,4534,4534,4534,4534
unique,1,154,154,4,1894,8,99
top,ensembl_gene,ENSG00000044115,CTNNA1,uniprot,836,interaction_proteins_in_pathway_reactome,protein coding
freq,4534,126,126,2528,19,903,885


In [8]:
d['node'].unique()

array(['transcript_biotype_ensembl',
       'gene_evolutionary_conservation_homologene',
       'gene_molecular_function_ncbigene',
       'gene_biological_process_ncbigene',
       'protein_disease_related_proteins_uniprot',
       'structure_data_existence_uniprot',
       'interaction_proteins_in_pathway_reactome',
       'disease_diseases_mondo'], dtype=object)

## Extract subset

In [9]:
first_row_orig_id = d.head(1)['orig_entry'][0]
first_row_orig_id

'ENSG00000172915'

In [10]:
orig_id = first_row_orig_id
subset = d[d['orig_entry'] == orig_id]
subset

Unnamed: 0,orig_dataset,orig_entry,orig_label,dest_dataset,dest_entry,node,value
0,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000379939,transcript_biotype_ensembl,protein coding
1,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000400445,transcript_biotype_ensembl,protein coding
2,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000537702,transcript_biotype_ensembl,protein coding
3,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000629018,transcript_biotype_ensembl,protein coding
4,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_evolutionary_conservation_homologene,"Insect, Worm"
5,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_molecular_function_ncbigene,binding
6,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_biological_process_ncbigene,cellular process
7,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_biological_process_ncbigene,localization
8,ensembl_gene,ENSG00000172915,NBEA,uniprot,A0A0D9SF28,protein_disease_related_proteins_uniprot,Unclassified
9,ensembl_gene,ENSG00000172915,NBEA,uniprot,A0A8I5KQL6,protein_disease_related_proteins_uniprot,Unclassified


In [11]:
subset['orig_id'] = subset['orig_dataset'] + ':' + subset['orig_entry']
subset['dest_id'] = subset['dest_dataset'] + ':' + subset['dest_entry']
subset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['orig_id'] = subset['orig_dataset'] + ':' + subset['orig_entry']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset['dest_id'] = subset['dest_dataset'] + ':' + subset['dest_entry']


Unnamed: 0,orig_dataset,orig_entry,orig_label,dest_dataset,dest_entry,node,value,orig_id,dest_id
0,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000379939,transcript_biotype_ensembl,protein coding,ensembl_gene:ENSG00000172915,ensembl_transcript:ENST00000379939
1,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000400445,transcript_biotype_ensembl,protein coding,ensembl_gene:ENSG00000172915,ensembl_transcript:ENST00000400445
2,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000537702,transcript_biotype_ensembl,protein coding,ensembl_gene:ENSG00000172915,ensembl_transcript:ENST00000537702
3,ensembl_gene,ENSG00000172915,NBEA,ensembl_transcript,ENST00000629018,transcript_biotype_ensembl,protein coding,ensembl_gene:ENSG00000172915,ensembl_transcript:ENST00000629018
4,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_evolutionary_conservation_homologene,"Insect, Worm",ensembl_gene:ENSG00000172915,ncbigene:26960
5,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_molecular_function_ncbigene,binding,ensembl_gene:ENSG00000172915,ncbigene:26960
6,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_biological_process_ncbigene,cellular process,ensembl_gene:ENSG00000172915,ncbigene:26960
7,ensembl_gene,ENSG00000172915,NBEA,ncbigene,26960,gene_biological_process_ncbigene,localization,ensembl_gene:ENSG00000172915,ncbigene:26960
8,ensembl_gene,ENSG00000172915,NBEA,uniprot,A0A0D9SF28,protein_disease_related_proteins_uniprot,Unclassified,ensembl_gene:ENSG00000172915,uniprot:A0A0D9SF28
9,ensembl_gene,ENSG00000172915,NBEA,uniprot,A0A8I5KQL6,protein_disease_related_proteins_uniprot,Unclassified,ensembl_gene:ENSG00000172915,uniprot:A0A8I5KQL6


In [12]:
id_pairs = subset[['orig_id', 'dest_id']].drop_duplicates()
id_pairs = id_pairs[id_pairs['orig_id'] != id_pairs['dest_id']]
id_pairs

Unnamed: 0,orig_id,dest_id
0,ensembl_gene:ENSG00000172915,ensembl_transcript:ENST00000379939
1,ensembl_gene:ENSG00000172915,ensembl_transcript:ENST00000400445
2,ensembl_gene:ENSG00000172915,ensembl_transcript:ENST00000537702
3,ensembl_gene:ENSG00000172915,ensembl_transcript:ENST00000629018
4,ensembl_gene:ENSG00000172915,ncbigene:26960
8,ensembl_gene:ENSG00000172915,uniprot:A0A0D9SF28
9,ensembl_gene:ENSG00000172915,uniprot:A0A8I5KQL6
10,ensembl_gene:ENSG00000172915,uniprot:A0A8I5KQP5
11,ensembl_gene:ENSG00000172915,uniprot:A0A8I5KRX1
12,ensembl_gene:ENSG00000172915,uniprot:A0A8I5KRZ1


## Visualization

### Configuration

In [13]:
categories_color = {
  "Analysis": { "color": "#696969" },
  "Compound": { "color": "#a853c6" },
  "Disease": { "color": "#5361c6" },
  "Domain": { "color": "#a2c653" },
  "Experiment": { "color": "#696969" },
  "Function": { "color": "#696969" },
  "Gene": { "color": "#53c666" },
  "Glycan": { "color": "#673aa6" },
  "Interaction": { "color": "#c65381" },
  "Literature": { "color": "#696969" },
  "Ortholog": { "color": "#53c666" },
  "Pathway": { "color": "#c65381" },
  "Probe": { "color": "#53c666" },
  "Project": { "color": "#696969" },
  "Protein": { "color": "#a2c653" },
  "Reaction": { "color": "#c65381" },
  "Sample": { "color": "#696969" },
  "SequenceRun": { "color": "#696969" },
  "Structure": { "color": "#c68753" },
  "Submission": { "color": "#696969" },
  "Taxonomy": { "color": "#006400" },
  "Transcript": { "color": "#53c666" },
  "Variant": { "color": "#53c3c6" },
};

In [26]:
attributes_json_url = 'https://github.com/togodx/togodx-config-human/raw/develop/config/attributes.dx-server.json'
attributes_json = json.loads(urlopen(attributes_json_url).read())
# attributes_json

In [27]:
togoid_dataset_config_url = 'https://github.com/togoid/togoid-config/raw/main/config/dataset.yaml'
dataset_config = yaml.safe_load(urlopen(togoid_dataset_config_url).read())
# dataset_config

In [16]:
dataset_color = {}
for ds in dataset_config:
    cat = dataset_config[ds]['category']
    if not cat in categories_color:
        continue
    dataset_color[ds] = categories_color[cat]['color']

dataset_color

{'affy_probeset': '#53c666',
 'bioproject': '#696969',
 'biosample': '#696969',
 'ccds': '#53c666',
 'chebi': '#a853c6',
 'chembl_compound': '#a853c6',
 'chembl_target': '#a2c653',
 'civic_gene': '#53c666',
 'clinvar': '#53c3c6',
 'dbsnp': '#53c3c6',
 'dgidb': '#c65381',
 'doid': '#5361c6',
 'drugbank': '#a853c6',
 'ec': '#696969',
 'ena': '#53c666',
 'ensembl_gene': '#53c666',
 'ensembl_protein': '#a2c653',
 'ensembl_transcript': '#53c666',
 'flybase_gene': '#53c666',
 'gea': '#696969',
 'glytoucan': '#673aa6',
 'go': '#696969',
 'hgnc': '#53c666',
 'hgnc_symbol': '#53c666',
 'hmdb': '#a853c6',
 'homologene': '#53c666',
 'cog': '#53c666',
 'hp': '#5361c6',
 'human_protein_atlas': '#a2c653',
 'inchi_key': '#a853c6',
 'insdc': '#53c666',
 'insdc_master': '#696969',
 'intact': '#c65381',
 'interpro': '#a2c653',
 'kegg_compound': '#a853c6',
 'kegg_disease': '#5361c6',
 'kegg_orthology': '#53c666',
 'kegg_pathway': '#c65381',
 'kegg_reaction': '#c65381',
 'lrg': '#53c666',
 'mbgd_gene': '#

### PyVis init

In [17]:
net = Network(notebook=True, cdn_resources='in_line', bgcolor='gray', font_color='white')

### Add hub node (primary dataset) and connected nodes (attributes added by togodx)

In [18]:
for oid in id_pairs['orig_id'].drop_duplicates():
    net.add_node(oid, color='white')

In [19]:
for did in id_pairs['dest_id'].drop_duplicates():
    ds = did.split(':')[0]
    color = dataset_color[ds]
    net.add_node(did, color=color)

In [20]:
net.add_edges(id_pairs.to_numpy())

In [21]:
net.toggle_physics(True)
net.show('mygraph.html')

In [22]:
outer_entries = subset[subset['dest_dataset'] != subset['orig_dataset'][0] ][['dest_dataset', 'dest_entry']].drop_duplicates()
outer_entries

Unnamed: 0,dest_dataset,dest_entry
0,ensembl_transcript,ENST00000379939
1,ensembl_transcript,ENST00000400445
2,ensembl_transcript,ENST00000537702
3,ensembl_transcript,ENST00000629018
4,ncbigene,26960
8,uniprot,A0A0D9SF28
9,uniprot,A0A8I5KQL6
10,uniprot,A0A8I5KQP5
11,uniprot,A0A8I5KRX1
12,uniprot,A0A8I5KRZ1


In [23]:
datasets_dict = attributes_json['datasets']
# datasets_dict

{'ensembl_gene': {'label': 'Ensembl gene',
  'template': 'https://raw.githubusercontent.com/togodx/togodx-config-human/develop/templates/ensembl_gene.hbs',
  'target': True,
  'examples': ['ENSG00000148584',
   'ENSG00000164398',
   'ENSG00000127914',
   'ENSG00000117020',
   'ENSG00000111275',
   'ENSG00000029534',
   'ENSG00000078061',
   'ENSG00000100852',
   'ENSG00000104728',
   'ENSG00000074964',
   'ENSG00000143970',
   'ENSG00000198604',
   'ENSG00000126453',
   'ENSG00000029363',
   'ENSG00000115760',
   'ENSG00000112175',
   'ENSG00000159388',
   'ENSG00000261652',
   'ENSG00000164305',
   'ENSG00000132906',
   'ENSG00000112237',
   'ENSG00000183813',
   'ENSG00000126353',
   'ENSG00000090659',
   'ENSG00000178562',
   'ENSG00000040731',
   'ENSG00000079112',
   'ENSG00000124762',
   'ENSG00000121289',
   'ENSG00000173575',
   'ENSG00000109220',
   'ENSG00000171310',
   'ENSG00000172409',
   'ENSG00000176571',
   'ENSG00000174469',
   'ENSG00000168542',
   'ENSG00000164919',


In [24]:
# Connecting outer nodes (!= primary dataset, the network hub node)
for index, row in outer_entries.iterrows():
    for target in outer_entries['dest_dataset'].unique():
        # Loop for the datasets other than itself
        if target != row['dest_dataset']:
            time.sleep(1)
            print('from: ' + row['dest_dataset'] + ' to: ' + target)
            
            config_dataset = datasets_dict[row['dest_dataset']]['conversion']
            if not target in config_dataset:
                continue

            togoid_api_route = config_dataset[target]
            api_url = togoid_api_route + row['dest_entry']
            print(api_url)
            
            res_json = urlopen(api_url)
            res = json.loads(res_json.read())

            n_from = row['dest_dataset'] + ':' + row['dest_entry']
            for i in res['results']:
                n_to = target + ':' + i
                print('Edge from:' + n_from + ', to: ' + n_to)
                net.add_node(n_to)
                net.add_edge(n_from, n_to)

from: ensembl_transcript to: ncbigene
https://api.togoid.dbcls.jp/convert?format=json&route=ensembl_transcript,ncbigene&ids=ENST00000379939
Edge from:ensembl_transcript:ENST00000379939, to: ncbigene:26960
from: ensembl_transcript to: uniprot
https://api.togoid.dbcls.jp/convert?format=json&route=ensembl_transcript,uniprot&ids=ENST00000379939
Edge from:ensembl_transcript:ENST00000379939, to: uniprot:Q5T321
from: ensembl_transcript to: mondo
https://api.togoid.dbcls.jp/convert?format=json&route=ensembl_transcript,ncbigene,medgen,mondo&ids=ENST00000379939
Edge from:ensembl_transcript:ENST00000379939, to: mondo:0030930
from: ensembl_transcript to: ncbigene
https://api.togoid.dbcls.jp/convert?format=json&route=ensembl_transcript,ncbigene&ids=ENST00000400445
Edge from:ensembl_transcript:ENST00000400445, to: ncbigene:26960
from: ensembl_transcript to: uniprot
https://api.togoid.dbcls.jp/convert?format=json&route=ensembl_transcript,uniprot&ids=ENST00000400445
Edge from:ensembl_transcript:ENST00

from: uniprot to: ensembl_transcript
from: uniprot to: ncbigene
https://api.togoid.dbcls.jp/convert?format=json&route=uniprot,ncbigene&ids=A0A8I5QKR6
from: uniprot to: mondo
https://api.togoid.dbcls.jp/convert?format=json&route=uniprot,ncbigene,medgen,mondo&ids=A0A8I5QKR6
from: uniprot to: ensembl_transcript
from: uniprot to: ncbigene
https://api.togoid.dbcls.jp/convert?format=json&route=uniprot,ncbigene&ids=Q5T321
Edge from:uniprot:Q5T321, to: ncbigene:26960
from: uniprot to: mondo
https://api.togoid.dbcls.jp/convert?format=json&route=uniprot,ncbigene,medgen,mondo&ids=Q5T321
Edge from:uniprot:Q5T321, to: mondo:0030930
from: uniprot to: ensembl_transcript
from: uniprot to: ncbigene
https://api.togoid.dbcls.jp/convert?format=json&route=uniprot,ncbigene&ids=Q8NFP9
Edge from:uniprot:Q8NFP9, to: ncbigene:26960
from: uniprot to: mondo
https://api.togoid.dbcls.jp/convert?format=json&route=uniprot,ncbigene,medgen,mondo&ids=Q8NFP9
Edge from:uniprot:Q8NFP9, to: mondo:0030930
from: mondo to: ens

In [25]:
net.toggle_physics(True)
net.show('mygraph.html')