# Sequence to function

## The problem
The problem we are working on is called sequence to function. Ideally this means we have a sequence and we infer some kind of function

### The sequence: 
To me this means a few different things. 
- Gene sequence mutations
- Gene and Protein Orthologs
- Post-translational modification

### The function: 
This also could mean a few things. My mind initially went to physiology, but now realized function could be alot broader. For example simple ligand enzyme binding could be related to funciton. The pathway that the protein is involved in, the regulation
- A change in ligand binding interactions (Initial GPCR activation)
- A change in metabolitic secondary activity (GPCRs downstream)

## The solution
When thinking about a solution to the hackathon problem. Ryan has suggested, could we start with building a full profile with only two data categories (for the sequence). To me this means: 
- Gene name, mutation/ortholog
- Protein name, post-translational modification

Given those could we pull out somekind of knowledge graph that will allow us to relate it to currently known research. This is where we will need the __agent__

In [1]:
import sys, os
sys.path.append("..")

from scripts.fetch_data import fetch_uniprot_data, split_colon_list
from scripts.epmc_utils import fetch_epmc, save_dataframe_rows_as_json
import pandas as pd


In [2]:
#This simple script creates the JSON directory if it doesn't exist. We are adding this to .gitignore, maybe it is too heavy?
if not os.path.isdir("../data/corpus/"):
    os.makedirs("../data/corpus/", exist_ok = True)
    print("Making a new directory")
else:
    print("Directory already exists")

Directory already exists


In [3]:
genage_human = pd.read_csv("../data/raw/genage_human.csv")
display(genage_human.head())
genes = genage_human['symbol'].values
print(len(genes))

Unnamed: 0,GenAge ID,symbol,name,entrez gene id,uniprot,why
0,1,GHR,growth hormone receptor,2690,GHR_HUMAN,mammal
1,2,GHRH,growth hormone releasing hormone,2691,SLIB_HUMAN,mammal
2,3,SHC1,SHC (Src homology 2 domain containing) transfo...,6464,SHC1_HUMAN,mammal
3,4,POU1F1,POU class 1 homeobox 1,5449,PIT1_HUMAN,mammal
4,5,PROP1,PROP paired-like homeobox 1,5626,PROP1_HUMAN,mammal


307


In [4]:
# Maybe here can go the parsing terms
uniprot_data = fetch_uniprot_data(genes[0:9])
display(uniprot_data)
citation_list = split_colon_list(uniprot_data.citation_titles[0])

Fetching UniProt data for GHR
Fetching UniProt data for GHRH
Fetching UniProt data for SHC1
Fetching UniProt data for POU1F1
Fetching UniProt data for PROP1
Fetching UniProt data for TP53
Fetching UniProt data for TERC
✗ No UniProt result for TERC
Fetching UniProt data for TERT
Fetching UniProt data for ATM


Unnamed: 0,gene_symbol,uniprot_id,protein_name,sequence,pmids,dois,citation_titles,reviewed
0,GHR,P10912,Growth hormone receptor,MDLWQLLLTLALAGSSDAFSGSEATAAILSRAPWSLQSVNPGLKTN...,,,Growth hormone receptor and serum binding prot...,False
1,GHRH,P01286,Somatoliberin,MPLWVFFFVILTLSNSSHCSPPPPLTLRMRRYADAIFTNSYRKVLG...,,,Cloning and sequence analysis of cDNA for the ...,False
2,SHC1,P29353,SHC-transforming protein 1,MDLLPPKPKYNPLRNESLSSLEEGASGSTPPEELPSPSASSLGPIL...,,,A novel transforming protein (SHC) with an SH2...,False
3,POU1F1,P28069,Pituitary-specific positive transcription fact...,MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMS...,,,Cloning of the human cDNA for transcription fa...,False
4,PROP1,O75360,Homeobox protein prophet of Pit-1,MEAERRRQAEKPKKGRVGSNLLPERHPATGTPTTTVDSSAPPCRRL...,,,"Human Prop-1: cloning, mapping, genomic struct...",False
5,TP53,P04637,Cellular tumor antigen p53,MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLS...,,,Human p53 cellular tumor antigen: cDNA sequenc...,False
6,TERT,O14746,Telomerase reverse transcriptase,MPRAPRCRAVRSLLRSHYREVLPLATFVRRLGPQGWRLVQRGDPAA...,,,"hEST2, the putative human telomerase catalytic...",False
7,ATM,Q13315,Serine-protein kinase ATM,MSLVLNDLLICCRQLEHDRATERKKEVEKFKRLIRDPETIKHLDRH...,,,The complete sequence of the coding region of ...,False


In [None]:
uniprot_citation_records = [
    fetch_epmc(citation)
    for citation in citation_list
]
uniprot_citations_df = pd.DataFrame(uniprot_citation_records)


Fetching EPMC metadata for: Growth hormone receptor and serum binding protein: purification, cloning and expression.
Metadata fetched: {'PMC': '1718137', 'DOI': '10.1136/adc.81.5.378', 'PMID': '10519707', 'PMCID': 'PMC1718137', 'title': 'Growth hormone insensitivity: a widening diagnosis.', 'journal': None, 'year': 1999, 'source_url': 'https://europepmc.org/article/PMC/PMC1718137', 'source': 'MED'}
Fetching EPMC metadata for: Characterization of the human growth hormone receptor gene and demonstration of a partial gene deletion in two patients with Laron-type dwarfism.
Metadata fetched: {'PMC': '298219', 'DOI': '10.1073/pnas.86.20.8083', 'PMID': '2813379', 'PMCID': 'PMC298219', 'title': 'Characterization of the human growth hormone receptor gene and demonstration of a partial gene deletion in two patients with Laron-type dwarfism.', 'journal': None, 'year': 1989, 'source_url': 'https://europepmc.org/article/PMC/PMC298219', 'source': 'MED', 'abstract_text': "Laron-type dwarfism is an au

In [None]:
saved_uniprot_citations = save_dataframe_rows_as_json(
    uniprot_citations_df,
    "../data/corpus/",
    id_column="PMID",
    filename_prefix="EPMC_",
    indent=4,
    drop_missing=True,
)
print(f"Saved {len(saved_uniprot_citations)} UniProt citation JSON files.")