# HGNC Gene Families

**Author:** Charles Tapley Hoyt

**Estimated Runtime:** 15 Seconds

This notebook outlines the process to programatically download a curated list of Gene Families from HGNC.

In [1]:
import pandas as pd
import os
import time

import pybel_tools as pbt
from pybel.utils import ensure_quotes

In [2]:
time.asctime()

'Tue Apr  4 23:18:09 2017'

In [3]:
pbt.__version__

'0.1.6-dev'

In [4]:
pybel_resources_base = os.environ['PYBEL_RESOURCES_BASE']

## Download

The data comes from the HGNC custom downloads page.

In [5]:
HGNC_GENE_FAMILY_URL = 'http://www.genenames.org/cgi-bin/genefamilies/download-all/tsv'

In [6]:
df = pd.read_csv(HGNC_GENE_FAMILY_URL, sep='\t')
df.head()

Unnamed: 0,HGNC ID,Approved Symbol,Approved Name,Status,Previous Symbols,Synonyms,Chromosome,Accession Numbers,RefSeq IDs,Gene Family Tag,Gene family description,Gene family ID
0,324,AGPAT1,1-acylglycerol-3-phosphate O-acyltransferase 1,Approved,,LPAAT-alpha,6p21.32,U56417,NM_006411,AGPAT,1-acylglycerol-3-phosphate O-acyltransferases,46
1,325,AGPAT2,1-acylglycerol-3-phosphate O-acyltransferase 2,Approved,BSCL,LPAAT-beta,9q34.3,AF000237,NM_006412,AGPAT,1-acylglycerol-3-phosphate O-acyltransferases,46
2,326,AGPAT3,1-acylglycerol-3-phosphate O-acyltransferase 3,Approved,,LPAAT-gamma,21q22.3,AF156774,NM_020132,AGPAT,1-acylglycerol-3-phosphate O-acyltransferases,46
3,20885,AGPAT4,1-acylglycerol-3-phosphate O-acyltransferase 4,Approved,,"LPAAT-delta, dJ473J16.2",6q26,AF156776,NM_020133,AGPAT,1-acylglycerol-3-phosphate O-acyltransferases,46
4,20886,AGPAT5,1-acylglycerol-3-phosphate O-acyltransferase 5,Approved,,"FLJ11210, LPAAT-e, LPAAT-epsilon",8p23.1,AF375789,NM_018361,AGPAT,1-acylglycerol-3-phosphate O-acyltransferases,46


In [7]:
entries = set(df['Gene family description'].unique())

In [8]:
len(entries)

973

## Export BELNS

PyBEL Tools contains a function to write namespaces to a stream, given a bit of metadata.

In [9]:
with open(os.path.join(pybel_resources_base, 'gfam.belns'), 'w') as f:
    pbt.definition_utils.write_namespace(
        "HGNC Gene Families",
        "GFAM",
        "Gene and Gene Products",
        'Charles Tapley Hoyt',
        HGNC_GENE_FAMILY_URL,
        (entry.strip() for entry in entries),
        namespace_species='9606',
        namespace_description="HUGO Gene Nomenclature Committee (HGNC) curated gene families",
        author_copyright='WTF License',
        functions="GRP",
        author_contact="charles.hoyt@scai.fraunhofer.de",
        file=f
    )

## Export BEL

The membership relationships are exported as a BEL script.

In [10]:
HGNC_URL = 'https://owncloud.scai.fraunhofer.de/index.php/s/JsfpQvkdx3Y5EMx/download?path=hgnc-human-genes.belns'
GFAM_URL = 'https://owncloud.scai.fraunhofer.de/index.php/s/JsfpQvkdx3Y5EMx/download?path=gfam.belns'

In [11]:
namespace_dict = {
    'HGNC': HGNC_URL,
    'GFAM': GFAM_URL
}

In [12]:
with open(os.path.join(pybel_resources_base, 'gfam_members.bel'), 'w') as f:
    pbt.document_utils.write_boilerplate(
        document_name='Gene Family Definitions',
        authors='Charles Tapley Hoyt',
        contact='charles.hoyt@scai.fraunhofer.de',
        description='Encoding of the gene family memberships',
        namespace_dict=namespace_dict,
        annotations_dict={},
        file=f
    )
    
    print('SET Citation = {"PubMed","HGNC","25361968"}', file=f)
    print('SET Evidence = "HGNC Definitions"', file=f)
    
    for _, gfam, gene in df[['Gene family description', 'Approved Symbol']].itertuples():
        gfam_clean = ensure_quotes(gfam.strip())
        gene_clean = ensure_quotes(gene.strip())
        
        print('g(HGNC:{}) isA g(GFAM:{})'.format(gene_clean, gfam_clean), file=f)