# Serializing GMT Files to CSV
In this notebook we'll convert GMT files from [Enrichr](https://maayanlab.cloud/Enrichr) to CSV files that can be ingested in the graph database. This includes several steps:
1. Mapping/ generating ids to the terms
2. Mapping genes to their Entrez ID
3. Creating the CSV file

## CSV Serialization
Nodes and edges are serialized differently for our knowledge graph. 

### Node Serialization
Serialized nodes requires two columns: id and label. Optionally, you can add more columns for additional metadata. CSV file should be formatted this way: `<node_type>.node.csv` for it to be compatible with the provided ingestion script. This means for our GMT files, we need two node files: (1) label type, and (2) genes

|      |   id          |   label                                                                    |   ontology_label                                              |   uri                                                  |
|------|---------------|----------------------------------------------------------------------------|---------------------------------------------------------------|--------------------------------------------------------|
|   0  |   GO:0051084  |   'de novo' posttranslational protein folding (GO:0051084)                 |   'de novo' posttranslational protein folding                 |   http://amigo.geneontology.org/amigo/term/GO:0051084  |
|   1  |   GO:0006103  |   2-oxoglutarate metabolic process (GO:0006103)                            |   2-oxoglutarate metabolic process                            |   http://amigo.geneontology.org/amigo/term/GO:0006103  |
|   2  |   GO:0050428  |   3'-phosphoadenosine 5'-phosphosulfate biosynthetic process (GO:0050428)  |   3'-phosphoadenosine 5'-phosphosulfate biosynthetic process  |   http://amigo.geneontology.org/amigo/term/GO:0050428  |
|   3  |   GO:0050427  |   3'-phosphoadenosine 5'-phosphosulfate metabolic process (GO:0050427)     |   3'-phosphoadenosine 5'-phosphosulfate metabolic process     |   http://amigo.geneontology.org/amigo/term/GO:0050427  |
|   4  |   GO:0061158  |   3'-UTR-mediated mRNA destabilization (GO:0061158)                        |   3'-UTR-mediated mRNA destabilization                        |   http://amigo.geneontology.org/amigo/term/GO:0061158  |
|   5  |   GO:0070935  |   3'-UTR-mediated mRNA stabilization (GO:0070935)                          |   3'-UTR-mediated mRNA stabilization                          |   http://amigo.geneontology.org/amigo/term/GO:0070935  |

### Edge Serialization

Meanwhile, edges require (1) source id, (2) the relation, and (3) target id columns. The rest are optional metadata. CSV file should be formatted as follows: `<source_node_type>.<relation>.<target_node_type>.edges.csv`.

|      |   source  |   relation  |   target      |   source_label  |   target_label                                              |   resource       |   link_to_resource          |
|------|-----------|-------------|---------------|-----------------|-------------------------------------------------------------|------------------|-----------------------------|
|   0  |   23753   |   GO BP     |   GO:0051084  |   SDF2L1        |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   1  |   3313    |   GO BP     |   GO:0051084  |   HSPA9         |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   2  |   10576   |   GO BP     |   GO:0051084  |   CCT2          |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   3  |   6767    |   GO BP     |   GO:0051084  |   ST13          |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   4  |   3310    |   GO BP     |   GO:0051084  |   HSPA6         |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |
|   5  |   957     |   GO BP     |   GO:0051084  |   ENTPD5        |   'de novo' posttranslational protein folding (GO:0051084)  |   Gene Ontology  |   http://geneontology.org/  |

In [4]:
import requests
import json
import re
from tqdm import tqdm
import os
import pandas as pd
import time
import uuid

### Gene Mapper
To start our conversion, we need a way to map the gene names to their respective gene ids. The following code downloads the metadata for Homo sapiens genes from NCBI gene and creates a mapper that returns the gene id. It does this by (1) mapping gene labels to ID, (2) mapping synonyms to ID, (3) mapping upper case gene labels and synonyms to ids. (3) is done to address the fact that Enrichr gene names are all upper case. Ambiguous labels (i.e. names with multiple ids) are removed from the map. The function `get_gene_meta` extends this and returns a dictionary containing the gene id, label, and uri which can be used for our serialization.

In [5]:
def fetch_save_read(url, file, reader=pd.read_csv, sep='\t', **kwargs):
  ''' Download file from {url}, save it to {file}, and subsequently read it with {reader} using pandas options on {**kwargs}.
  '''
  if not os.path.exists(file):
    if os.path.dirname(file):
      os.makedirs(os.path.dirname(file), exist_ok=True)
    df = reader(url, sep=sep, index_col=None)
    df.to_csv(file, sep=sep, index=False)
  return pd.read_csv(file, sep=sep, **kwargs)

In [6]:
organism = "Mammalia/Homo_sapiens"
url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/{}.gene_info.gz'.format(organism)
file = '{}.gene_info.tsv'.format(organism)

ncbi_gene = fetch_save_read(url, file)


In [7]:
def maybe_split(record):
    ''' NCBI Stores Nulls as '-' and lists '|' delimited
    '''
    if record in {'', '-'}:
        return set()
    return set(record.split('|'))

def supplement_dbXref_prefix_omitted(ids):
    ''' NCBI Stores external IDS with Foreign:ID while most datasets just use the ID
    '''
    for id in ids:
        # add original id
        yield id
        # also add id *without* prefix
        if ':' in id:
            yield id.split(':', maxsplit=1)[1]

In [8]:
ncbi_gene['All_synonyms'] = [
    set.union(
      maybe_split(gene_info['Symbol']),
      maybe_split(gene_info['Symbol_from_nomenclature_authority']),
      maybe_split(str(gene_info['GeneID'])),
      maybe_split(gene_info['Synonyms']),
      maybe_split(gene_info['Other_designations']),
      maybe_split(gene_info['LocusTag']),
      set(supplement_dbXref_prefix_omitted(maybe_split(gene_info['dbXrefs']))),
    )
    for _, gene_info in ncbi_gene.iterrows()
  ]

synonyms, gene_id = zip(*{
    (synonym, gene_info['GeneID'])
    for _, gene_info in ncbi_gene.iterrows()
    for synonym in gene_info['All_synonyms']
  })
ncbi_lookup_syn = pd.Series(gene_id, index=synonyms)
symbols, cap, gene_id = zip(*{
    (gene_info['Symbol'], gene_info['Symbol'].upper(), gene_info['GeneID'])
    for _, gene_info in ncbi_gene.iterrows()
  })
ncbi_lookup_sym = pd.Series(gene_id, index=symbols)
ncbi_lookup_sym_cap = pd.Series(gene_id, index=cap)

In [9]:
index_values = ncbi_lookup_syn.index.value_counts()
ambiguous = index_values[index_values > 1].index
ncbi_lookup_syn_disambiguated = ncbi_lookup_syn[(
(ncbi_lookup_syn.index == ncbi_lookup_syn) | (~ncbi_lookup_syn.index.isin(ambiguous))
)]
def gene_lookup(gene):
    gene_id = ncbi_lookup_sym.to_dict().get(gene)
    if gene_id: return str(gene_id)
    gene_id = ncbi_lookup_sym_cap.to_dict().get(gene)
    if gene_id: return str(gene_id)
    return str(ncbi_lookup_syn_disambiguated.to_dict().get(gene))

In [10]:
gene_lookup("H4-16")

'None'

In [11]:
gene_lookup("STAT3")

'6774'

In [12]:
all_genes = {}
gene_ids = set()
def get_gene_meta(gene):
    if gene in all_genes:
        return all_genes[gene]
    else:
        gene_id = gene_lookup(gene)
        if gene_id in gene_ids:
            return None
        elif gene_id == 'None':
            return None
        elif gene_id == None:
            return None
        else:
            gene_ids.add(gene_id)
            all_genes[gene] = {
                "id": gene_id,
                "label": gene,
                "uri": "https://www.ncbi.nlm.nih.gov/gene/%s"%gene_id
            }
            return all_genes[gene]

get_gene_meta("COPB2")

{'id': '9276',
 'label': 'COPB2',
 'uri': 'https://www.ncbi.nlm.nih.gov/gene/9276'}

In [13]:
get_gene_meta('STAT3')

{'id': '6774',
 'label': 'STAT3',
 'uri': 'https://www.ncbi.nlm.nih.gov/gene/6774'}

### Downloading the GMT files from Enrichr
The following code downloads the GMT file from Enrichr. This function checks the existence of the file locally before downloading it.

In [15]:
def fetch_and_save_library(library, file):
  ''' Download file from {url}, save it to {file}, and subsequently read it with {reader} using pandas options on {**kwargs}.
  '''
  if not os.path.exists(file):
    if os.path.dirname(file):
      os.makedirs(os.path.dirname(file), exist_ok=True)
    gmt_url = "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=%s"%library
    res = requests.get(gmt_url)
    gmt = res.text
    with open(file, 'w') as o:
        o.write(gmt)
  
  with open(file) as o:
    return o.read().strip().split("\n")

### Serializing GMT Files
Now that we have everything we need, let's try converting some Enrichr libraries to CSV files. Before we start let's define first a dictionary to store all the gene metadata. We'll use this to generate a combined gene node csv file later.

#### Using regular expression to get the term id: GO_Biological_Process_2021
The following code block downloads the gmt file if it does not exist

In [16]:
library = "GO_Biological_Process_2021"
filename = "gmt/%s.gmt"%library
gmt = fetch_and_save_library(library, filename)
print(gmt[0])

'de novo' posttranslational protein folding (GO:0051084)		SDF2L1	HSPA9	CCT2	ST13	HSPA6	ENTPD5	HSPA5	PTGES3	HSPA1L	HSPA8	DNAJB13	HSPA2	DNAJB14	HSPE1	DNAJC18	GAK	DNAJC7	DNAJB12	HSPA1A	HSPA1B	ERO1A	SELENOF	HSPA14	HSPA13	DNAJB1	CHCHD4	BAG1	DNAJB5	DNAJB4	SDF2	UGGT1	


![GO](./img/graph.png)

##### Serializing the nodes
For this GMT file, notice that the label already contains a persistent id that we can use as a node id. We can extract it by utilizing regular expressions.

In [17]:
name = "'de novo' posttranslational protein folding (GO:0051084)"
regex="(?P<label>(?P<ontology_label>.+) \((?P<id>GO\:.+)\))"
props = re.search(regex, name).groupdict()
print(json.dumps(props, indent=4))

{
    "label": "'de novo' posttranslational protein folding (GO:0051084)",
    "ontology_label": "'de novo' posttranslational protein folding",
    "id": "GO:0051084"
}


In [18]:
def gene_set_name_resolver(name):
    regex="(?P<label>(?P<ontology_label>.+) \((?P<id>GO\:.+)\))"
    props = re.search(regex, name).groupdict()
    props["uri"] = "http://amigo.geneontology.org/amigo/term/%s"%props["id"]
    return props

In [19]:
term = gmt[0].split("\t\t")[0]

In [20]:
gene_set_name_resolver(term)

{'label': "'de novo' posttranslational protein folding (GO:0051084)",
 'ontology_label': "'de novo' posttranslational protein folding",
 'id': 'GO:0051084',
 'uri': 'http://amigo.geneontology.org/amigo/term/GO:0051084'}

##### Serializing the edges

Now that we have a way to get process the term and gene nodes, let's now get to serializing edges. For a GMT file, we say that an edge exists between a gene and a term if the gene is part of that term's gene set, that is if we have the following:
```
Term 1      Gene 1  Gene 2  Gene 3
Term 2      Gene 2  Gene 4  Gene 5
```
Then we say that there is an edge between Term 1 and Gene 1, Gene 2, and Gene 3, and Term 2 and Gene 2, Gene 4, and Gene 5.

In [None]:
def iterate_gmt(gmt, term_node, relation, gene_set_name_resolver, resource=None, edge_properties={}):
    terms = []
    edges = []
    for line in tqdm(gmt):
        term, *genes = line.strip().split("\t")
        genes = genes[1:]
        term_meta = gene_set_name_resolver(term)
        if term_meta:
            term_id = term_meta["id"]
            terms.append(term_meta)
            for gene in genes:
                gene_meta = get_gene_meta(gene)
                if gene_meta:
                    if type(gene_meta) == str:
                        print(gene, gene_meta)
                    gene_id = gene_meta["id"]
                    edge = {
                        "source": term_id,
                        "relation": relation,
                        "target": gene_id,
                        "source_label": term,
                        "target_label": gene,
                        **edge_properties.get((term, gene), {})
                    }
                    if resource:
                        edge["resource"] = resource
                    edges.append(edge)
    term_df = pd.DataFrame.from_records(terms)
    cols = ["id", "label"] + [i for i in term_df.columns if not i in ["id", "label"]]
    term_df = term_df[cols]
    term_df.to_csv("csv/%s.nodes.csv"%term_node, index=False)
    edge_df = pd.DataFrame.from_records(edges)
    edge_df.to_csv("csv/%s.%s.Gene.edges.csv"%(term_node, relation), index=False)

In [None]:
term_node = "GO Biological Process Term"
relation = "GO BP"
resource = "http://geneontology.org/"
iterate_gmt(gmt, term_node, relation, gene_set_name_resolver, resource)

100%|██████████| 6036/6036 [10:22<00:00,  9.69it/s]  


In [21]:
node_df = pd.read_csv('./csv/GO Biological Process Term.nodes.csv')
node_df.head()

Unnamed: 0,id,label,ontology_label,uri
0,GO:0051084,'de novo' posttranslational protein folding (G...,'de novo' posttranslational protein folding,http://amigo.geneontology.org/amigo/term/GO:00...
1,GO:0006103,2-oxoglutarate metabolic process (GO:0006103),2-oxoglutarate metabolic process,http://amigo.geneontology.org/amigo/term/GO:00...
2,GO:0050428,3'-phosphoadenosine 5'-phosphosulfate biosynth...,3'-phosphoadenosine 5'-phosphosulfate biosynth...,http://amigo.geneontology.org/amigo/term/GO:00...
3,GO:0050427,3'-phosphoadenosine 5'-phosphosulfate metaboli...,3'-phosphoadenosine 5'-phosphosulfate metaboli...,http://amigo.geneontology.org/amigo/term/GO:00...
4,GO:0061158,3'-UTR-mediated mRNA destabilization (GO:0061158),3'-UTR-mediated mRNA destabilization,http://amigo.geneontology.org/amigo/term/GO:00...


In [22]:
edge_df = pd.read_csv('./csv/GO Biological Process Term.GO BP.Gene.edges.csv')
edge_df.head()

Unnamed: 0,source,relation,target,source_label,target_label,resource
0,GO:0051084,GO BP,23753,'de novo' posttranslational protein folding (G...,SDF2L1,http://geneontology.org/
1,GO:0051084,GO BP,3313,'de novo' posttranslational protein folding (G...,HSPA9,http://geneontology.org/
2,GO:0051084,GO BP,10576,'de novo' posttranslational protein folding (G...,CCT2,http://geneontology.org/
3,GO:0051084,GO BP,6767,'de novo' posttranslational protein folding (G...,ST13,http://geneontology.org/
4,GO:0051084,GO BP,3310,'de novo' posttranslational protein folding (G...,HSPA6,http://geneontology.org/


#### Exercise: MGI_Mammalian_Phenotype_Level_4_2021
Create a `gene_set_name_resolver` function for MGI_Mammalian_Phenotype_Level_4_2021

In [23]:
library = 'MGI_Mammalian_Phenotype_Level_4_2021'
filename = "gmt/%s.gmt"%library
gmt = fetch_and_save_library(library, filename)
print(gmt[0])

abdominal situs ambiguus MP:0011250		CCDC39	DNAH5	INVS	DNAH11	DNAAF3	FOXH1	RPGRIP1L	DRC1	DNAI1	IFT27	


In [None]:
def gene_set_name_resolver(name):
    pass

In [None]:
term_node = "Mouse Phenotype"
relation = "MP"
resource = "http://www.informatics.jax.org"
iterate_gmt(gmt, term_node, relation, gene_set_name_resolver, resource)

#### Using APIs to get the node id: KEGG_2021_Human
If the ID is not in the term, we can use an API to create a mapping between a term and the id. This example uses the KEGG's rest API to download the pathways and their respective IDs

In [24]:
library = "KEGG_2021_Human"
gmt = fetch_and_save_library(library, "gmt/%s"%library)
print(gmt[0])

ABC transporters		ABCA2	ABCC4	ABCG8	ABCC5	ABCA3	ABCC2	ABCA1	ABCC3	ABCC8	ABCA6	ABCA7	ABCC9	ABCA4	ABCC6	ABCA5	TAP2	TAP1	ABCA8	ABCA9	ABCA10	ABCB10	ABCA12	ABCB11	ABCC10	ABCG1	ABCG4	ABCC1	ABCG5	ABCG2	CFTR	ABCB4	ABCD3	ABCB1	ABCD4	ABCB7	ABCB8	ABCB5	ABCB6	ABCB9	ABCA13	ABCC11	DEFB1	ABCC12	ABCD1	ABCD2	


In [25]:
kegg_pathways = {}
res = requests.get("https://rest.kegg.jp/list/pathway")
count = 0
for i in res.text.strip().split("\n"):
    kid, label = i.strip().split("\t")
    count += 1
    kegg_pathways[label] = kid

In [26]:
kegg_pathways["ABC transporters"]

'path:map02010'

In [27]:
def gene_set_name_resolver(name):
    kegg_id = kegg_pathways[name] if name in kegg_pathways else name
    props = {
        "id": kegg_id,
        "label": name
    }
    if name in kegg_pathways:
        props["uri"] = "https://www.genome.jp/entry/%s"%props["id"]
    return props 

In [None]:
term_node = "KEGG Pathway"
relation = "KEGG"
resource = "https://www.genome.jp/kegg/"
iterate_gmt(gmt, term_node, relation, gene_set_name_resolver, resource)

100%|██████████| 320/320 [00:39<00:00,  8.17it/s]


In [28]:
node_df = pd.read_csv('./csv/KEGG Pathway.nodes.csv')
node_df.head()

Unnamed: 0,id,label,uri
0,path:map02010,ABC transporters,https://www.genome.jp/entry/path:map02010
1,path:map04933,AGE-RAGE signaling pathway in diabetic complic...,https://www.genome.jp/entry/path:map04933
2,path:map04152,AMPK signaling pathway,https://www.genome.jp/entry/path:map04152
3,path:map05221,Acute myeloid leukemia,https://www.genome.jp/entry/path:map05221
4,path:map04520,Adherens junction,https://www.genome.jp/entry/path:map04520


In [29]:
edge_df = pd.read_csv('./csv/KEGG Pathway.KEGG.Gene.edges.csv')
edge_df.head()

Unnamed: 0,source,relation,target,source_label,target_label,resource
0,path:map02010,KEGG,20,ABC transporters,ABCA2,https://www.genome.jp/kegg/
1,path:map02010,KEGG,10257,ABC transporters,ABCC4,https://www.genome.jp/kegg/
2,path:map02010,KEGG,64241,ABC transporters,ABCG8,https://www.genome.jp/kegg/
3,path:map02010,KEGG,10057,ABC transporters,ABCC5,https://www.genome.jp/kegg/
4,path:map02010,KEGG,21,ABC transporters,ABCA3,https://www.genome.jp/kegg/


#### Exercise: Cancer_Cell_Line_Encyclopedia
Use the [Cellosaurus API](https://api.cellosaurus.org/) to create a `gene_set_name_resolver` for Cancer_Cell_Line_Encyclopedia. You may have to redundant ids. One way to do this is to generate a uuid based on the label's name instead:
```
id = str(uuid.uuid5(uuid.NAMESPACE_URL, gene_set_name))
```

### Cellosaurus example

In [31]:
res = requests.get("https://api.cellosaurus.org/search/cell-line?q=id:MCF-10A&start=0&rows=10&format=json&fields=id,ac")
res.json()

{'Cellosaurus': {'cell-line-list': [{'accession-list': [{'type': 'primary',
      'value': 'CVCL_0598'}],
    'name-list': [{'type': 'identifier', 'value': 'MCF-10A'}]},
   {'accession-list': [{'type': 'primary', 'value': 'CVCL_JM26'}],
    'name-list': [{'type': 'identifier', 'value': 'MCF-10A PTEN(-/-)'}]},
   {'accession-list': [{'type': 'primary', 'value': 'CVCL_RA88'}],
    'name-list': [{'type': 'identifier', 'value': 'MCF-10A shPARG'}]},
   {'accession-list': [{'type': 'primary', 'value': 'CVCL_JM25'}],
    'name-list': [{'type': 'identifier', 'value': 'MCF-10A TP53(-/-)'}]},
   {'accession-list': [{'type': 'primary', 'value': 'CVCL_6C54'}],
    'name-list': [{'type': 'identifier', 'value': 'MCF-10A-neo'}]},
   {'accession-list': [{'type': 'primary', 'value': 'CVCL_6C55'}],
    'name-list': [{'type': 'identifier', 'value': 'MCF-10A-neoN'}]},
   {'accession-list': [{'type': 'primary', 'value': 'CVCL_5554'}],
    'name-list': [{'type': 'identifier', 'value': 'MCF-10A-neoT'}]},
   

In [30]:
library = "Cancer_Cell_Line_Encyclopedia"
gmt = fetch_and_save_library(library, "gmt/%s"%library)
print(gmt[0])

NCIH810 LUNG		CHGA	RBFOX1	C22ORF42	CACNA2D1	FLJ33996	TM4SF5	TAAR1	VWA5B2	FBXW4P1	MAPK8IP2	GRIN3A	SPATA17	LOC100505760	RFX6	CD2BP2	PCDHB6	LOC100506851	PCDHB9	LOC100506974	ATF3	HCN4	C1ORF194	LOC285548	ADARB2	DLL1	DLL3	MTMR7	DLL4	ASCL1	LOC100507495	LOC100128840	LOC100506046	KIAA1324	C10ORF108	NKX2-2	BSN	HES6	CHST9	ST18	GADD45G	C9ORF66	LOC285556	NR0B2	KIAA1486	LOC100506082	TMEM198	CBFA2T2	RIIAD1	RIPPLY2	C7ORF41	FAM123C	KIAA1239	UBAC2-AS1	FAM135B	PRKAR1B	LOC100652825	ARHGAP19-SLIT1	PRR18	SCN2A	WFDC10A	DOC2A	ARX	GBA2	TMEM186	HTATSF1	PPP3R2	PPP4C	PKHD1	PCDHA2	SI	KCNMB2	PDK4	LOC147670	ZBED1	KAT8	FFAR2	PCDHA3	TRPM6	SOX5	RIMKLA	GABRA2	LRRC27	LOC100379224	ACE	CCDC110	LOC284648	KCNB2	PABPC5	GPBAR1	PAX4	IGLV7-43	A1CF	RHOB	POU6F2	CLDN11	MRAS	CALY	PAH	LOC100506777	FAM181B	CLDN18	RGS7BP	ANKS4B	RAPGEF4	HOXD9	PCSK2	GUCY2C	PCSK1	GABRB1	C2CD4A	PTPRN	HNF4G	HEPACAM2	LOC100129617	ZNF48	LRRC31	FAM149A	DHRSX	KIF1A	SNTG2	QSOX2	C16ORF59	SLC18A1	LIN52	C16ORF53	TMED6	NEUROD1	LOC100652951	TMOD2	TMEM176A	TMEM176B	ST

In [25]:
cell_lines = {}
def gene_set_name_resolver(name):
    pass

In [26]:
term_node = "CCLE Cell Line"
relation = "CCLE"
resource = "https://sites.broadinstitute.org/ccle"
iterate_gmt(gmt, term_node, relation, gene_set_name_resolver, resource)

## Gene.csv
Using the `all_genes` dictionary, create a `Genes.nodes.csv` file

In [30]:
# write your code here
genes = pd.DataFrame.from_records([i for i in all_genes.values() if not i == None])
genes.to_csv("csv/Gene.nodes.csv", index=False)

In [32]:
gene_df = pd.read_csv('csv/Gene.nodes.csv')
gene_df.head()

Unnamed: 0,id,label,uri
0,9276,COPB2,https://www.ncbi.nlm.nih.gov/gene/9276
1,23753,SDF2L1,https://www.ncbi.nlm.nih.gov/gene/23753
2,3313,HSPA9,https://www.ncbi.nlm.nih.gov/gene/3313
3,10576,CCT2,https://www.ncbi.nlm.nih.gov/gene/10576
4,6767,ST13,https://www.ncbi.nlm.nih.gov/gene/6767


#### Ingestion
Ingestion is relatively simple if we followed followed the naming convention. `src/import_csv.py` is provided to do the heavy lifting. To run it just type the following on the command line:
```
python import_csv.py /path/to/csv/directory
```
This will run a bulk import of your csv files
e.g.
```
python import_csv.py ../notebooks/csv
```