# Introduction

This exercise makes use of the database you created in `Exercise02` and the BEL statement parsers you write with regular expressions in `Reading_searching_sending.ipynb`.

In [16]:
import pandas as pd
import os, json, re, time
time.asctime()

'Wed Oct  5 10:21:01 2016'

In [9]:
base = os.path.join(os.environ['BUG_FREE_EUREKA_BASE'])
base

'C:\\git\\bug-free-eureka'

# Task 1

This exercise is about loading the HGNC data to create a dictionary from HGNC symbols to set of enzyme ID's.

## 1.1 Load Data

Load json data from `/data/exercise02/hgnc_complete_set.json`.

In [17]:
data_path=os.path.join(base,'data','exercise02','hgnc_complete_set.json')
with open(data_path) as f:
    hgnc_json=json.load(f)
pd.DataFrame(hgnc_json).head()

Unnamed: 0,response,responseHeader
QTime,,16.0
docs,"[{'_version_': 1546503090507612160, 'ccds_id':...",
numFound,41049,
start,0,
status,,0.0


## 1.2 Reorganize Data into `pd.DataFrame`

Identify the relevant subdictionaries in your `dictionary -> response -> docs`. Load them to a data frame, 
then create a new data frame with just the HGNC symbol and Enzyme ID

In [18]:
docs=hgnc_json['response']['docs']
df_hgnc = pd.DataFrame(docs)
df_hgnc.columns
symbol2ec ={}
df_hgnc_sliced=df_hgnc[['symbol','enzyme_id']]

for idx,symbol,enzyme_ids in df_hgnc_sliced.itertuples():
    if  isinstance(enzyme_ids,list):
        symbol2ec[symbol]=set(enzyme_ids)
    else:
        symbol2ec[symbol]=None





## 1.3 Build dictionary for lookup

Iterate over this dataframe to build a dictionary that is `{hgnc symbol: set of enzyme id's}`. Call this dictionary `symbol2ec`

In [19]:
#symbol2ec = {}
'AKT1' in symbol2ec


len(symbol2ec)

41041

# Task 2

This subexercise is about validating protein and kinase activity statements in BEL. Refer to last Thursday's work in `Reading_searching_sending.ipynb`.

## 2.1 Valid HGNC

Write a function, `valid_hgnc(hgnc_symbol, symbol2ec_instance)` that takes a name and the dictionary from Task 1.3 and returns whether this is a valid name

In [20]:
def valid_hgnc(hgnc_symbol, symbol2ec_instance):
        return hgnc_symbol in symbol2ec_instance
     
       
assert valid_hgnc('AKT1',symbol2ec) 
assert not valid_hgnc('asasa',symbol2ec)

## 2.2 Valid Kinase Activity

Write a function, `valid_kinase(hgnc_symbol, symbol2ec_instance)` that takes a name and the dictionary from Task 1.3 and returns whether this protein has kinase activity. Hint: an enzyme code reference can be found [here](http://brenda-enzymes.org/ecexplorer.php?browser=1&f[nodes]=132&f[action]=open&f[change]=153)

In [21]:
def valid_kinase(hgnc_symbol, symbol2ec_instance):
    if not valid_hgnc(hgnc_symbol,symbol2ec_instance):
        False
    else:
        for ecID in symbol2ec_instance[hgnc_symbol]:
            if ecID.startswith( '2.7' ):
                print('ecID',ecID)
                return True
    return False      
   

valid_kinase('AKT1',symbol2ec)


ecID 2.7.11.1


True

In [22]:
match_protein = re.compile('p\(HGNC:(?P<name>\w+)\)')
match_protein.match('p(HGNC:ABC)').groupdict()
match_kin = re.compile('kin\(p\(HGNC:(?P<name>\w+)\)\)')
match_kin.match('kin(p(HGNC:ABC))').groupdict()


{'name': 'ABC'}

## 2.3 Putting it all together

Write a function, `validate_bel_term(term, symbol2ec_instance)` that parses a BEL term about either a protein, or the kinase activity of a protein and validates it.

```python
def validate_bel_term(term, symbol2ec_instance):
    pass
```

### Examples

```python
>>> # check that the proteins have valid HGNC codes
>>> validate_bel_term('p(HGNC:APP)', symbol2ec)
True
>>> validate_bel_term('p(HGNC:ABCDEF)', symbol2ec)
False
>>> # check that kinase activity annotations are only on proteins that are
>>> # actually protein kinases (hint: check EC annotation)
>>> validate_bel_term('kin(p(HGNC:APP))', symbol2ec)
False
>>> validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec)
True
```

In [23]:
def validate_bel_term(term, symbol2ec_instance):
    
    if (term.startswith('p') ):
        match_protein = re.compile('p\(HGNC:(?P<name>\w+)\)')
        #protein=match_protein.match('p(HGNC:ABC)').groupdict()
        protein=match_protein.match(term).groupdict()
    elif(term.startswith('kin') ):
        match_kin = re.compile('kin\(p\(HGNC:(?P<name>\w+)\)\)')
        #protein=match_kin.match('kin(p(HGNC:ABC))').groupdict()
        protein=match_kin.match(term).groupdict()
        
    print('Protein',protein['name'])
    protein=protein['name']
    
    return valid_kinase(protein,symbol2ec_instance)    
    
print(validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec))    


Protein AKT1
ecID 2.7.11.1
True


asasa

# Task 3

This task is about manual curation of text. You will be guided through translating the following text into BEL statements as strings within a python list.

## Document Definitions

Recall citations are written with source, title, then identifier as follows:

```
SET Citation = {"PubMed", "Nat Cell Biol 2007 Mar 9(3) 316-23", "17277771"}
```

Use these annotations and these namespaces:

```
DEFINE NAMESPACE HGNC AS URL "http://resource.belframework.org/belframework/20131211/namespace/hgnc-human-genes.belns"

DEFINE ANNOTATION CellLocation as LIST {"cell nucleus", "cytoplasm", "endoplasmic reticulum"}
```


## Source Text

> The following statements are from the document "BEL Exercise" in edition 00001 of the PyBEL Journal.
> The kinase activity of A causes the increased abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 in the cytoplasm, 
> but only the increased expression of AKT serine/threonine kinase 1 in the endoplasmic reticulum. 
> Additionally, the abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 were found to be postively correlated in the cell nuclei.
> AKT serine/threonine kinase 2 increases GSK3 Beta in all of the nuclei, cyoplasm, and ER.

In [3]:
definition_statements = [
    'SET DOCUMENT name = "BEL Exercise"'
    'DEFINE NAMESPACE HGNC AS URL "http://resource.belframework.org/belframework/20131211/namespace/hgnc-human-genes.belns"',
    'DEFINE ANNOTATION CellLocation AS LIST {"cell nucleus", "cytoplasm", "endoplasmic reticulum"}',
]

In [4]:
# hint: there should be 11 statements from this text
your_statements = [
    ' A -> ',
    '',
    '',
    '',
    '',
    '',
    '',
    '',
    '',
    '',
    ''
]

In [5]:
statements = definition_statements + your_statements

# Task 4

This task is again about regular expressions. Return to `Reading_searching_sending.ipynb` and find your regular expressions that parse the subject, predicate, and object from a statement like `p(HGNC:AKT1) pos p(HGNC:AKT2)`

## 4.1 Validating Statements

Write a function `validate_bel_statement(statement, symbol2ec)` that takes a subject, predicate, object BEL statement as a string and determines if it its subject and objects are valid.

In [6]:
def validate_bel_statement(statement, symbol2ec):
    pass

## 4.2 Validating Your Statements

Run this cell to validate the BEL statements you've written.

In [7]:
for statement in your_statements:
    valid = validate_bel_statement(statement, symbol2ec)
    print('{} is {}valid'.format(statement, '' if valid else 'in'))

NameError: name 'symbol2ec' is not defined

## 4.3 Visualization

Use `pybel` to visualize the network.

In [None]:
try:
    import pybel
    import networkx
    
    g = pybel.from_bel(statements)
    nx.draw_spring(g, with_labels=True)
except:
    print('PyBEL not installed')