# Serializing GMT Files to CSV

In this notebook we'll convert GMT files from [Enrichr](https://maayanlab.cloud/Enrichr) to CSV files that can be ingested in the graph database. This includes several steps:
1. Mapping/ generating ids to the terms
2. Mapping genes to their Entrez ID
3. Creating the CSV file

## CSV Serialization
Nodes and edges are serialized differently for our knowledge graph. 

### Node Serialization
Serialized nodes requires two columns: id (which can be any unique identifier like UUID, ontology id, or other persistent identifiers) and label. Optionally, you can add more columns for additional metadata. CSV file should be formatted this way: `<node_type>.node.csv` for it to be compatible with the provided ingestion script. This means for our GMT files, we need two node files: (1) label type, and (2) genes

|      |   id          |   label                                                                    |   ontology_label                                              |   uri                                                  |
|------|---------------|----------------------------------------------------------------------------|---------------------------------------------------------------|--------------------------------------------------------|
|   0  |   GO:0051084  |   'de novo' posttranslational protein folding (GO:0051084)                 |   'de novo' posttranslational protein folding                 |   http://amigo.geneontology.org/amigo/term/GO:0051084  |
|   1  |   GO:0006103  |   2-oxoglutarate metabolic process (GO:0006103)                            |   2-oxoglutarate metabolic process                            |   http://amigo.geneontology.org/amigo/term/GO:0006103  |
|   2  |   GO:0050428  |   3'-phosphoadenosine 5'-phosphosulfate biosynthetic process (GO:0050428)  |   3'-phosphoadenosine 5'-phosphosulfate biosynthetic process  |   http://amigo.geneontology.org/amigo/term/GO:0050428  |
|   3  |   GO:0050427  |   3'-phosphoadenosine 5'-phosphosulfate metabolic process (GO:0050427)     |   3'-phosphoadenosine 5'-phosphosulfate metabolic process     |   http://amigo.geneontology.org/amigo/term/GO:0050427  |
|   4  |   GO:0061158  |   3'-UTR-mediated mRNA destabilization (GO:0061158)                        |   3'-UTR-mediated mRNA destabilization                        |   http://amigo.geneontology.org/amigo/term/GO:0061158  |
|   5  |   GO:0070935  |   3'-UTR-mediated mRNA stabilization (GO:0070935)                          |   3'-UTR-mediated mRNA stabilization                          |   http://amigo.geneontology.org/amigo/term/GO:0070935  |

### Edge Serialization

Meanwhile, edges require (1) source id, (2) the relation, and (3) target id columns. The rest are optional metadata. CSV file should be formatted as follows: `<source_node_type>.<relation>.<target_node_type>.edges.csv`.

|      |   source  |   relation  |   target      |   source_label  |   target_label                                              |   resource       |   link_to_resource          |
|------|-----------|-------------|---------------|-----------------|-------------------------------------------------------------|------------------|-----------------------------|
|   0  |   23753   |   GO BP     |   GO:0051084  |   SDF2L1        |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   1  |   3313    |   GO BP     |   GO:0051084  |   HSPA9         |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   2  |   10576   |   GO BP     |   GO:0051084  |   CCT2          |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   3  |   6767    |   GO BP     |   GO:0051084  |   ST13          |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   4  |   3310    |   GO BP     |   GO:0051084  |   HSPA6         |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   5  |   957     |   GO BP     |   GO:0051084  |   ENTPD5        |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |

In [None]:
import requests
import json
import re
from tqdm import tqdm
import os
import pandas as pd
import time
import uuid

In [None]:
def fetch_save_read(url, file, reader=pd.read_csv, sep='\t', **kwargs):
  ''' Download file from {url}, save it to {file}, and subsequently read it with {reader} using pandas options on {**kwargs}.
  '''
  if not os.path.exists(file):
    if os.path.dirname(file):
      os.makedirs(os.path.dirname(file), exist_ok=True)
    df = reader(url, sep=sep, index_col=None)
    df.to_csv(file, sep=sep, index=False)
  return pd.read_csv(file, sep=sep, **kwargs)

In [None]:
organism = "Mammalia/Homo_sapiens"
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/{}.gene_info.gz'.format(organism)
file = '{}.gene_info.tsv'.format(organism)

ncbi_gene = fetch_save_read(url, file)


In [None]:
def maybe_split(record):
    ''' NCBI Stores Nulls as '-' and lists '|' delimited
    '''
    if record in {'', '-'}:
        return set()
    return set(record.split('|'))

def supplement_dbXref_prefix_omitted(ids):
    ''' NCBI Stores external IDS with Foreign:ID while most datasets just use the ID
    '''
    for id in ids:
        # add original id
        yield id
        # also add id *without* prefix
        if ':' in id:
            yield id.split(':', maxsplit=1)[1]

In [None]:
ncbi_gene['All_synonyms'] = [
    set.union(
      maybe_split(gene_info['Symbol']),
      maybe_split(gene_info['Symbol_from_nomenclature_authority']),
      maybe_split(str(gene_info['GeneID'])),
      maybe_split(gene_info['Synonyms']),
      maybe_split(gene_info['Other_designations']),
      maybe_split(gene_info['LocusTag']),
      set(supplement_dbXref_prefix_omitted(maybe_split(gene_info['dbXrefs']))),
    )
    for _, gene_info in ncbi_gene.iterrows()
  ]

synonyms, gene_id = zip(*{
    (synonym, gene_info['GeneID'])
    for _, gene_info in ncbi_gene.iterrows()
    for synonym in gene_info['All_synonyms']
  })
ncbi_lookup_syn = pd.Series(gene_id, index=synonyms)
symbols, cap, gene_id = zip(*{
    (gene_info['Symbol'], gene_info['Symbol'].upper(), gene_info['GeneID'])
    for _, gene_info in ncbi_gene.iterrows()
  })
ncbi_lookup_sym = pd.Series(gene_id, index=symbols)
ncbi_lookup_sym_cap = pd.Series(gene_id, index=cap)

In [None]:
index_values = ncbi_lookup_syn.index.value_counts()
ambiguous = index_values[index_values > 1].index
ncbi_lookup_syn_disambiguated = ncbi_lookup_syn[(
(ncbi_lookup_syn.index == ncbi_lookup_syn) | (~ncbi_lookup_syn.index.isin(ambiguous))
)]
sym_dict = ncbi_lookup_sym.to_dict()
syn_dict_cap = ncbi_lookup_sym_cap.to_dict()
syn_dict = ncbi_lookup_syn_disambiguated.to_dict()
def gene_lookup(gene):
    gene_id = sym_dict.get(gene)
    if gene_id: return str(gene_id)
    gene_id = syn_dict_cap.get(gene)
    if gene_id: return str(gene_id)
    return str(syn_dict.get(gene))

In [None]:
gene_lookup("H4-16")

In [None]:
gene_lookup("STAT3")

In [None]:
all_genes = {}
gene_ids = set()
def get_gene_meta(gene):
    if gene in all_genes:
        return all_genes[gene]
    else:
        gene_id = gene_lookup(gene)
        if gene_id in gene_ids:
            return None
        elif gene_id == 'None':
            return None
        elif gene_id == None:
            return None
        else:
            gene_ids.add(gene_id)
            all_genes[gene] = {
                "id": gene_id,
                "label": gene,
                "uri": "https://www.ncbi.nlm.nih.gov/gene/%s"%gene_id
            }
            return all_genes[gene]

get_gene_meta("COPB2")

In [None]:
get_gene_meta('STAT3')

### Downloading the GMT files from Enrichr
The following code downloads the GMT file from Enrichr. This function checks the existence of the file locally before downloading it.

In [None]:
def fetch_and_save_library(library, file):
  ''' Download file from {url}, save it to {file}, and subsequently read it with {reader} using pandas options on {**kwargs}.
  '''
  if not os.path.exists(file):
    if os.path.dirname(file):
      os.makedirs(os.path.dirname(file), exist_ok=True)
    gmt_url = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=%s"%library
    res = requests.get(gmt_url)
    gmt = res.text
    with open(file, 'w') as o:
        o.write(gmt)
  
  with open(file) as o:
    return o.read().strip().split("\n")

### Serializing GMT Files
Now that we have everything we need, let's try converting some Enrichr libraries to CSV files. Before we start let's define first a dictionary to store all the gene metadata. We'll use this to generate a combined gene node csv file later.

####  GO_Biological_Process_2021
The following code block downloads the gmt file if it does not exist

In [None]:
library = "GO_Biological_Process_2021"
filename = "gmt/%s.gmt"%library
gmt = fetch_and_save_library(library, filename)
print(gmt[0])

![GO](./img/graph.png)

##### Serializing the nodes

In [None]:
def gene_set_name_resolver(label):
    return {
        "id": str(uuid.uuid5(uuid.NAMESPACE_URL, label)),
        "label": label
    }

In [None]:
term = gmt[0].split("\t\t")[0]
term

In [None]:
gene_set_name_resolver(term)

##### Serializing the edges

Now that we have a way to get process the term and gene nodes, let's now get to serializing edges. For a GMT file, we say that an edge exists between a gene and a term if the gene is part of that term's gene set, that is if we have the following:
```
Term 1      Gene 1  Gene 2  Gene 3
Term 2      Gene 2  Gene 4  Gene 5
```
Then we say that there is an edge between Term 1 and Gene 1, Gene 2, and Gene 3, and Term 2 and Gene 2, Gene 4, and Gene 5.

In [None]:
def iterate_gmt(gmt, term_node, relation, gene_set_name_resolver, resource=None, edge_properties={}):
    terms = []
    edges = []
    for line in tqdm(gmt):
        term, *genes = line.strip().split("\t")
        genes = genes[1:]
        term_meta = gene_set_name_resolver(term)
        if term_meta:
            term_id = term_meta["id"]
            terms.append(term_meta)
            for gene in genes:
                gene_meta = get_gene_meta(gene)
                if gene_meta:
                    if type(gene_meta) == str:
                        print(gene, gene_meta)
                    gene_id = gene_meta["id"]
                    edge = {
                        "source": term_id,
                        "relation": relation,
                        "target": gene_id,
                        "source_label": term,
                        "target_label": gene,
                        **edge_properties.get((term, gene), {})
                    }
                    if resource:
                        edge["resource"] = resource
                    edges.append(edge)
    term_df = pd.DataFrame.from_records(terms)
    cols = ["id", "label"] + [i for i in term_df.columns if not i in ["id", "label"]]
    term_df = term_df[cols]
    term_df.to_csv("csv/%s.nodes.csv"%term_node, index=False)
    edge_df = pd.DataFrame.from_records(edges)
    edge_df.to_csv("csv/%s.%s.Gene.edges.csv"%(term_node, relation), index=False)

In [None]:
term_node = "GO Biological Process Term"
relation = "GO BP"
resource = "http://geneontology.org/"
iterate_gmt(gmt, term_node, relation, gene_set_name_resolver, resource)

In [None]:
node_df = pd.read_csv('./csv/GO Biological Process Term.nodes.csv')
node_df.head()

In [None]:
edge_df = pd.read_csv('./csv/GO Biological Process Term.GO BP.Gene.edges.csv')
edge_df.head()

#### Exercise: MGI_Mammalian_Phenotype_Level_4_2021
Create a `gene_set_name_resolver` function for MGI_Mammalian_Phenotype_Level_4_2021

In [None]:
library = 'MGI_Mammalian_Phenotype_Level_4_2021'
filename = "gmt/%s.gmt"%library
gmt = fetch_and_save_library(library, filename)
print(gmt[0])

In [None]:
def gene_set_name_resolver(name):
    pass

In [None]:
term_node = "Mouse Phenotype"
relation = "MP"
resource = "http://www.informatics.jax.org"
iterate_gmt(gmt, term_node, relation, gene_set_name_resolver, resource)

## Finding Up and Drug regulated genes from drug perturbation

In [None]:
df = pd.read_csv("https://minio.dev.maayanlab.cloud/kg-demo/lincs_consensus_drugs_5000.csv", index_col=0)
df.shape

In [None]:
df.iloc[0:5, 0:5]

## Getting the top 100 up-regulated genes:
1. For each drug, we sort the genes in descending order,
2. Take the top 100 positive scoring genes

In [None]:
drug = "afatinib"
sorted = df[drug].sort_values(ascending=False)
sorted

In [None]:
## Top 100 up regulated genes for afatinib
sorted.head(100)

## Putting it all together

In [None]:
for k,v in sorted.head(100).items():
    print(k,v)

In [None]:
drugs = []
edges = []
for drug in df.columns:
    ## generate ids for drugs
    drug_id = uuid.uuid5(uuid.NAMESPACE_URL, drug)
    drugs.append({
        "id": drug_id,
        "label": drug
    })
    sorted = df[drug].sort_values(ascending=False)
    up_regulated = sorted.head(100)
    for gene, score in up_regulated.items():
        gene_meta = get_gene_meta(gene)
        if gene_meta:
            if type(gene_meta) == str:
                print(gene, gene_meta)
            gene_id = gene_meta["id"]
            edge = {
                "source": drug_id,
                "relation": "LINCS Up Regulated",
                "target": gene_id,
                "source_label": drug,
                "target_label": gene,
                "score": score
            }
            edges.append(edge)


In [None]:
drug_df = pd.DataFrame.from_records(drugs)
cols = ["id", "label"] + [i for i in drug_df.columns if not i in ["id", "label"]]
drug_df = drug_df[cols]
drug_df.to_csv("csv/Drug.nodes.csv", index=False)
edge_df = pd.DataFrame.from_records(edges)
edge_df.to_csv("csv/Drug.LINCS Up Regulated.Gene.edges.csv", index=False)

In [None]:
# Exercise: Write a code for the down regulated genes

## Gene.csv
Using the `all_genes` dictionary, create a `Genes.nodes.csv` file

In [None]:
# write your code here
genes = pd.DataFrame.from_records([i for i in all_genes.values() if not i == None])
genes.to_csv("csv/Gene.nodes.csv", index=False)

In [None]:
gene_df = pd.read_csv('csv/Gene.nodes.csv')
gene_df.head()

#### Ingestion
Ingestion is relatively simple if we followed followed the naming convention. `src/import_csv.py` is provided to do the heavy lifting. To run it just type the following on the command line:
```
python import_csv.py /path/to/csv/directory
```
This will run a bulk import of your csv files
e.g.
```
python import_csv.py ../notebooks/csv
```