# Sequence to function

## The problem
The problem we are working on is called sequence to function. Ideally this means we have a sequence and we infer some kind of function

### The sequence: 
To me this means a few different things. 
- Gene sequence mutations
- Gene and Protein Orthologs
- Post-translational modification

### The function: 
This also could mean a few things. My mind initially went to physiology, but now realized function could be alot broader. For example simple ligand enzyme binding could be related to funciton. The pathway that the protein is involved in, the regulation
- A change in ligand binding interactions (Initial GPCR activation)
- A change in metabolitic secondary activity (GPCRs downstream)

## The solution
When thinking about a solution to the hackathon problem. Ryan has suggested, could we start with building a full profile with only two data categories (for the sequence). To me this means: 
- Gene name, mutation/ortholog
- Protein name, post-translational modification

Given those could we pull out somekind of knowledge graph that will allow us to relate it to currently known research. This is where we will need the __agent__

In [1]:
import sys, os
sys.path.append("..")

from scripts.fetch_data import fetch_uniprot_data, split_colon_list
from scripts.reference_scoring import (
    collect_reference_network_for_citations,
    score_reference_dataframe,
    select_top_scoring_articles,
)
from scripts.epmc_utils import fetch_epmc, save_dataframe_rows_as_json
import pandas as pd


In [10]:
#This simple script creates the JSON directory if it doesn't exist. We are adding this to .gitignore, maybe it is too heavy?
if not os.path.isdir("../data/corpus/"):
    os.makedirs("../data/corpus/", exist_ok = True)
    print("Making a new directory")
else:
    print("Directory already exists")

Making a new directory


In [7]:
# Maybe here can go the parsing terms
uniprot_data = fetch_uniprot_data(["APOE", "CCR2"])
display(uniprot_data)
citation_list = split_colon_list(uniprot_data.citation_titles[0])

Unnamed: 0,gene_symbol,uniprot_id,protein_name,sequence,pmids,dois,citation_titles,reviewed
0,APOE,P02649,Apolipoprotein E,MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWEL...,,,"Synthesis, intracellular processing, and signa...",False
1,CCR2,P41597,C-C chemokine receptor type 2,MLSTSRSRFIRNTNESGEEVTTFFDYDYGAPCHKFDVKQIGAQLLP...,,,Molecular cloning and functional expression of...,False


In [None]:
uniprot_citation_records = [
    fetch_epmc(citation)
    for citation in citation_list
]
uniprot_citations_df = pd.DataFrame(uniprot_citation_records)

uniprot_citations_df = score_reference_dataframe(
    uniprot_citations_df,
    delay=0.1,
    include_fulltext=True,
)
display(uniprot_citations_df.head())

saved_uniprot_citations = save_dataframe_rows_as_json(
    uniprot_citations_df,
    "../data/corpus/",
    id_column="PMID",
    filename_prefix="EPMC_",
    indent=4,
    drop_missing=True,
)
print(f"Saved {len(saved_uniprot_citations)} UniProt citation JSON files.")

Unnamed: 0,PMC,DOI,PMID,PMCID,title,journal,year,source_url,source,abstract_text,full_text,full_text_abstract,function_signal,longevity_signal,year_score,functionality_score,longevity_score,composite_score
0,12294565.0,10.3390/ijms26146693,40724942,PMC12294565,Understanding the Insulin-Degrading Enzyme: A ...,,2025,https://europepmc.org/article/PMC/PMC12294565,MED,,pmc Int J Mol Sci Int J Mol Sci ijms Internati...,Insulin-degrading enzyme (IDE) plays a critica...,0.0,0.0,1.0,0.0,0.0,0.4
1,,10.1007/s00109-018-1632-y,29516132,,"Cell-specific production, secretion, and funct...",,2018,https://europepmc.org/article/MED/29516132,MED,,,Apolipoprotein E (apoE) is a 34-kDa glycoprote...,1.0,0.0,0.125,0.761594,0.0,0.316558
2,5296245.0,10.1002/jnr.23823,27531392,PMC5296245,Restoring Soluble Amyloid Precursor Protein α ...,,2017,https://europepmc.org/article/PMC/PMC5296245,MED,,,"Soluble amyloid precursor protein α (sAPPα), a...",1.0,0.0,0.111111,0.761594,0.0,0.311002
3,4253862.0,10.1016/j.nbd.2014.08.025,25173806,PMC4253862,Apolipoprotein E: structure and function in li...,,2014,https://europepmc.org/article/PMC/PMC4253862,MED,,,Apolipoprotein (apo) E is a multifunctional pr...,1.0,0.0,0.083333,0.761594,0.0,0.299891
4,288296.0,10.1172/jci116728,8376602,PMC288296,Type III hyperlipoproteinemic phenotype in tra...,,1993,https://europepmc.org/article/PMC/PMC288296,MED,,,Transgenic mice were prepared that expressed a...,1.0,0.0,0.030303,0.761594,0.0,0.278679


Saved 91 UniProt citation JSON files.
