# Introduction

This exercise makes use of the database you created in `Exercise02` and the BEL statement parsers you write with regular expressions in `Reading_searching_sending.ipynb`.

In [1]:
import pandas as pd
import os, json, re, time
time.asctime()

'Wed Oct  5 02:47:10 2016'

In [2]:
base = os.path.join(os.environ['BUG_FREE_EUREKA_BASE'])
base

'C:\\Users\\Tamal\\Documents\\GitHub\\bug-free-eureka'

# Task 1

This exercise is about loading the HGNC data to create a dictionary from HGNC symbols to set of enzyme ID's.

## 1.1 Load Data

Load json data from `/data/exercise02/hgnc_complete_set.json`.

In [3]:
data_path = os.path.join(base, 'data', 'exercise02', 'hgnc_complete_set.json')
with open(data_path) as f:
    hgnc_json = json.load(f)
    

## 1.2 Reorganize Data into `pd.DataFrame`

Identify the relevant subdictionaries in your `dictionary -> response -> docs`. Load them to a data frame, 
then create a new data frame with just the HGNC symbol and Enzyme ID

In [4]:
docs = hgnc_json['response']['docs']
df_hgnc = pd.DataFrame(docs)
list(df_hgnc.columns)

['_version_',
 'alias_name',
 'alias_symbol',
 'bioparadigms_slc',
 'ccds_id',
 'cd',
 'cosmic',
 'date_approved_reserved',
 'date_modified',
 'date_name_changed',
 'date_symbol_changed',
 'ena',
 'ensembl_gene_id',
 'entrez_id',
 'enzyme_id',
 'gene_family',
 'gene_family_id',
 'hgnc_id',
 'homeodb',
 'horde_id',
 'imgt',
 'intermediate_filament_db',
 'iuphar',
 'kznf_gene_catalog',
 'lncrnadb',
 'location',
 'location_sortable',
 'locus_group',
 'locus_type',
 'lsdb',
 'mamit-trnadb',
 'merops',
 'mgd_id',
 'mirbase',
 'name',
 'omim_id',
 'orphanet',
 'prev_name',
 'prev_symbol',
 'pseudogene.org',
 'pubmed_id',
 'refseq_accession',
 'rgd_id',
 'snornabase',
 'status',
 'symbol',
 'ucsc_id',
 'uniprot_ids',
 'uuid',
 'vega_id']

## 1.3 Build dictionary for lookup

Iterate over this dataframe to build a dictionary that is `{hgnc symbol: set of enzyme id's}`. Call this dictionary `symbol2ec`

In [5]:
df_hgnc[['hgnc_id', 'enzyme_id']].head(5)

Unnamed: 0,hgnc_id,enzyme_id
0,HGNC:5,
1,HGNC:37133,
2,HGNC:24086,
3,HGNC:7,
4,HGNC:27057,


In [6]:
symbol2ec = {}

df_hgnc_sliced = df_hgnc[['symbol', 'enzyme_id']]

In [7]:
for idx, symbol, enzyme_ids in df_hgnc_sliced.itertuples():
    if isinstance(enzyme_ids, list):
        symbol2ec[symbol] = enzyme_ids
    else:
        symbol2ec[symbol] = None

In [8]:
for k in symbol2ec:
    print(k)

LINC00470
SNX12
RPL32P13
TRA-AGC16-1
IGHD4-11
OR7E14P
MIR4495
FGF13-AS1
RPL10AP5
INHBC
CYP2R1
KRT16P3
MIR3690
CKAP2
PDZD8
ASPSCR1
MSK15
FAM87B
LINC00862
RPL7AP7
LINC00266-4P
ZNF345
DNAL1
HNRNPA1P55
DCAF17
PRIM1
RNU6-430P
LPAR1
SERPINF2
DCAF12L1
BTNL9
PLEKHA2
MIR567
FRA18B
MRPS31P1
TTTY27P
MYADM
TRX-CAT1-3
MTCO1P22
C22orf34
C12orf71
VN2R18P
RPS24P17
PIRC39
NPY2R
SHFM1
OR2B8P
DENR
CLEC4G
UBE2Q2P9
OR8B2
NBL1
RNU6-722P
RNU6-473P
HSPE1P25
VSTM2B
TBC1D25
GAREM1
ITGAL
MTND4P37
ESRRG
WDR66
SH3RF1
RPL29P31
RSBN1
RNU1-117P
CSH2
CNTN4-AS2
EMX2
BBOX1-AS1
FXYD2
RN7SL744P
CNTNAP3P2
ZRSR1
CDHR3
COPS7B
TRNAVP2
KIAA1191
MRPS26
RNU6-584P
PDE4C
RPS14P5
LCMT1
SEC61B
RPL31P23
CRABP2
DPRXP6
SETX
MTHFS
MIR514B
RPS27AP6
ESRRB
MAPK7
CTAGE13P
RPL7AP53
RPL7AP9
TRG-GCC1-3
ANTXR2
BDKRB1
GSN-AS1
RN7SL231P
RNU7-200P
LINC01601
FNDC8
WHSC1
PPP4R3A
RPL36A
ARSJ
LIPT2
EML5
KRT19P1
MRT17
PPP1R3D
ZNF97
C8orf46
FAM216A
MIR3157
RN7SKP59
RNU2-34P
AKR1B10P1
MTCO3P30
MAN1A2
BRK1P2
RN7SL331P
OR2AP1
CICP14
SASH3
EPHB6
WDFY2
MRPL4

# Task 2

This subexercise is about validating protein and kinase activity statements in BEL. Refer to last Thursday's work in `Reading_searching_sending.ipynb`.

## 2.1 Valid HGNC

Write a function, `valid_hgnc(hgnc_symbol, symbol2ec_instance)` that takes a name and the dictionary from Task 1.3 and returns whether this is a valid name

In [9]:
def valid_hgnc(hgnc_symbol, symbol2ec_instance):
    return (hgnc_symbol in symbol2ec_instance)
assert valid_hgnc('AKT1', symbol2ec)
#list(symbol2ec.items())[:10]
assert valid_hgnc('tamal', symbol2ec)

AssertionError: 

## 2.2 Valid Kinase Activity

Write a function, `valid_kinase(hgnc_symbol, symbol2ec_instance)` that takes a name and the dictionary from Task 1.3 and returns whether this protein has kinase activity. Hint: an enzyme code reference can be found [here](http://brenda-enzymes.org/ecexplorer.php?browser=1&f[nodes]=132&f[action]=open&f[change]=153)

In [None]:
# These are kinases
symbol2ec['AKT1'], symbol2ec['PIK3CA'],symbol2ec['AKT2']

In [11]:
def valid_kinase(hgnc_symbol, symbol2ec_instance):
    if not valid_hgnc(hgnc_symbol, symbol2ec_instance):
        return False
    # what makes this true
    #symbol2ec_instance[hgnc_symbol] is a list of EC ids
    for ec_id in symbol2ec_instance[hgnc_symbol]:
        if ec_id.startswith('2.7.'):
            return True
    
assert valid_kinase('AKT1', symbol2ec)
assert valid_kinase('AKT2', symbol2ec)

In [12]:
match_protein = re.compile('p\(HGNC:(?P<name>\w+)\)')
match_protein.match('p(HGNC:ABC)').groupdict()
match_kin = re.compile('kin\(p\(HGNC:(?P<name>\w+)\)\)')
match_kin.match('kin(p(HGNC:ABC))').groupdict()

{'name': 'ABC'}

## 2.3 Putting it all together

Write a function, `validate_bel_term(term, symbol2ec_instance)` that parses a BEL term about either a protein, or the kinase activity of a protein and validates it.

```python
def validate_bel_term(term, symbol2ec_instance):
    pass
```

### Examples

```python
>>> # check that the proteins have valid HGNC codes
>>> validate_bel_term('p(HGNC:APP)', symbol2ec)
True
>>> validate_bel_term('p(HGNC:ABCDEF)', symbol2ec)
False
>>> # check that kinase activity annotations are only on proteins that are
>>> # actually protein kinases (hint: check EC annotation)
>>> validate_bel_term('kin(p(HGNC:APP))', symbol2ec)
False
>>> validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec)
True
```

In [13]:
def validate_bel_term(term, symbol2ec_instance):
    pass

In [15]:
def validate_bel_term(term, symbol2ec_instance):
    
    if (term.startswith('p') ):
        match_protein = re.compile('p\(HGNC:(?P<name>\w+)\)')
        #protein=match_protein.match('p(HGNC:ABC)').groupdict()
        protein=match_protein.match(term).groupdict()
    elif(term.startswith('kin') ):
        match_kin = re.compile('kin\(p\(HGNC:(?P<name>\w+)\)\)')
        #protein=match_kin.match('kin(p(HGNC:ABC))').groupdict()
        protein=match_kin.match(term).groupdict()
        
    print('Protein',protein['name'])
    protein=protein['name']
    
    return valid_kinase(protein,symbol2ec_instance)

print(validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec))

Protein AKT1
True


# Task 3

This task is about manual curation of text. You will be guided through translating the following text into BEL statements as strings within a python list.

## Document Definitions

Recall citations are written with source, title, then identifier as follows:

```
SET Citation = {"PubMed", "Nat Cell Biol 2007 Mar 9(3) 316-23", "17277771"}
```

Use these annotations and these namespaces:

```
DEFINE NAMESPACE HGNC AS URL "http://resource.belframework.org/belframework/20131211/namespace/hgnc-human-genes.belns"

DEFINE ANNOTATION CellLocation as LIST {"cell nucleus", "cytoplasm", "endoplasmic reticulum"}
```


## Source Text

> The following statements are from the document "BEL Exercise" in edition 00001 of the PyBEL Journal.
> The kinase activity of A causes the increased abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 in the cytoplasm, 
> but only the increased expression of AKT serine/threonine kinase 1 in the endoplasmic reticulum. 
> Additionally, the abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 were found to be postively correlated in the cell nuclei.
> AKT serine/threonine kinase 2 increases GSK3 Beta in all of the nuclei, cyoplasm, and ER.

In [None]:
definition_statements = [
    'SET DOCUMENT name = "BEL Exercise"'
    'DEFINE NAMESPACE HGNC AS URL "http://resource.belframework.org/belframework/20131211/namespace/hgnc-human-genes.belns"',
    'DEFINE ANNOTATION CellLocation AS LIST {"cell nucleus", "cytoplasm", "endoplasmic reticulum"}',
]

In [None]:
# hint: there should be 11 statements from this text
your_statements = [
    '',
    '',
    '',
    '',
    '',
    '',
    '',
    '',
    '',
    '',
    ''
]

In [None]:
statements = definition_statements + your_statements

# Task 4

This task is again about regular expressions. Return to `Reading_searching_sending.ipynb` and find your regular expressions that parse the subject, predicate, and object from a statement like `p(HGNC:AKT1) pos p(HGNC:AKT2)`

## 4.1 Validating Statements

Write a function `validate_bel_statement(statement, symbol2ec)` that takes a subject, predicate, object BEL statement as a string and determines if it its subject and objects are valid.

In [None]:
def validate_bel_statement(statement, symbol2ec):
    pass

## 4.2 Validating Your Statements

Run this cell to validate the BEL statements you've written.

In [None]:
for statement in your_statements:
    valid = validate_bel_statement(statement, symbol2ec)
    print('{} is {}valid'.format(statement, '' if valid else 'in'))

## 4.3 Visualization

Use `pybel` to visualize the network.

In [None]:
try:
    import pybel
    import networkx
    
    g = pybel.from_bel(statements)
    nx.draw_spring(g, with_labels=True)
except:
    print('PyBEL not installed')