# Sequence to function

## The problem
The problem we are working on is called sequence to function. Ideally this means we have a sequence and we infer some kind of function

### The sequence: 
To me this means a few different things. 
- Gene sequence mutations
- Gene and Protein Orthologs
- Post-translational modification

### The function: 
This also could mean a few things. My mind initially went to physiology, but now realized function could be alot broader. For example simple ligand enzyme binding could be related to funciton. The pathway that the protein is involved in, the regulation
- A change in ligand binding interactions (Initial GPCR activation)
- A change in metabolitic secondary activity (GPCRs downstream)

## The solution
When thinking about a solution to the hackathon problem. Ryan has suggested, could we start with building a full profile with only two data categories (for the sequence). To me this means: 
- Gene name, mutation/ortholog
- Protein name, post-translational modification

Given those could we pull out somekind of knowledge graph that will allow us to relate it to currently known research. This is where we will need the __agent__

In [1]:
import sys, os
sys.path.append("..")

from scripts.fetch_data import fetch_uniprot_data, split_colon_list
from scripts.reference_scoring import (
    collect_reference_network_for_citations,
    score_reference_dataframe,
    select_top_scoring_articles,
)
from scripts.epmc_utils import fetch_epmc, save_dataframe_rows_as_json
import pandas as pd


In [2]:
#This simple script creates the JSON directory if it doesn't exist. We are adding this to .gitignore, maybe it is too heavy?
if not os.path.isdir("../data/papers"):
    os.makedirs("../data/papers", exist_ok = True)
    print("Making a new directory")
else:
    print("Directory already exists")

Directory already exists


In [3]:
# Maybe here can go the parsing terms
uniprot_data = fetch_uniprot_data(["CCR2"])
display(uniprot_data)
citation_list = split_colon_list(uniprot_data.citation_titles[0])

Unnamed: 0,gene_symbol,uniprot_id,protein_name,sequence,pmids,dois,citation_titles,reviewed
0,CCR2,P41597,C-C chemokine receptor type 2,MLSTSRSRFIRNTNESGEEVTTFFDYDYGAPCHKFDVKQIGAQLLP...,,,Molecular cloning and functional expression of...,False


In [4]:
uniprot_citation_records = [
    fetch_epmc(citation)
    for citation in citation_list
]
uniprot_citations_df = pd.DataFrame(uniprot_citation_records)

uniprot_citations_df = score_reference_dataframe(
    uniprot_citations_df,
    delay=0.1,
    include_fulltext=True,
)
display(uniprot_citations_df.head())

saved_uniprot_citations = save_dataframe_rows_as_json(
    uniprot_citations_df,
    "../data/papers/uniprot_citations",
    id_column="PMID",
    filename_prefix="uniprot_",
    indent=4,
    drop_missing=True,
)
print(f"Saved {len(saved_uniprot_citations)} UniProt citation JSON files.")

Unnamed: 0,PMC,DOI,PMID,PMCID,title,journal,year,source_url,source,abstract_text,full_text,full_text_abstract,function_signal,longevity_signal,year_score,functionality_score,longevity_score,composite_score
0,,10.1038/ni1222,15995708,,Pivotal function for cytoplasmic protein FROUN...,,2005,https://europepmc.org/article/MED/15995708,MED,,,Ligation of the chemokine receptor CCR2 on mon...,1.25,0,0.047619,0.848284,0.0,0.315947
1,43448.0,10.1073/pnas.91.7.2752,8146186,PMC43448,Molecular cloning and functional expression of...,,1994,https://europepmc.org/article/PMC/PMC43448,MED,,,Monocyte chemoattractant protein 1 (MCP-1) is ...,1.25,0,0.03125,0.848284,0.0,0.309399
2,,10.1006/bbrc.1994.2049,8048929,,cDNA cloning and functional expression of a hu...,,1994,https://europepmc.org/article/MED/8048929,MED,,,A novel human G-protein-coupled seven-transmem...,1.25,0,0.03125,0.848284,0.0,0.309399
3,11198729.0,10.1016/j.cell.2024.05.021,38776920,PMC11198729,Human inherited CCR2 deficiency underlies prog...,,2024,https://europepmc.org/article/PMC/PMC11198729,MED,,pmc 0413066 2830 Cell Cell Cell 0092-8674 1097...,,0.0,0,0.5,0.0,0.0,0.2
4,5509255.0,10.1371/journal.pone.0181027,28704482,PMC5509255,Proteomic identification of proteins different...,,2017,https://europepmc.org/article/PMC/PMC5509255,MED,,PLoS One PLoS ONE plos plosone PLoS ONE 1932-6...,Reverse transcriptase activity of telomerase a...,0.0,0,0.111111,0.0,0.0,0.044444


Saved 20 UniProt citation JSON files.


In [5]:
reference_network = collect_reference_network_for_citations(
    uniprot_citations_df.iloc[0:3],
    include=("references", "citations"),
    max_depth=1,
    delay=0.1,
    top_n = 2
)
display(reference_network.head())

KeyboardInterrupt: 

In [None]:
reference_rows_scored = score_reference_dataframe(
    reference_network,
    delay=0.1,
    include_fulltext=False,
)

In [None]:
reference_rows_scored["score"] = reference_rows_scored["composite_score"]
saved_reference_files = save_dataframe_rows_as_json(
    reference_rows_scored,
    "../data/papers/reference_list",
    id_column="PMID",
    filename_prefix="reference_",
    indent=4,
    drop_missing=True,
)
print(f"Saved {len(saved_reference_files)} reference JSON files.")
display(reference_rows_scored[["gene_symbol", "title", "year", "score"]].head())

In [None]:
n_references = 10

if scored_network.empty:
    top_references = scored_network
    print("No articles available for ranking or export.")
else:
    top_references = select_top_scoring_articles(
        scored_network,
        n_per_gene=n_references,
        relation_filter=("reference", "citation"),
    )
    top_references = enrich_with_epmc_fulltext(
        top_references,
        delay=0.1,
        include_xml=True,
    )
    top_references["score"] = top_references["composite_score"]
    preview = top_references.assign(
        abstract_chars=top_references["abstract_text"].fillna("").str.len(),
        plain_text_chars=top_references["plain_text"].fillna("").str.len(),
    )
    display(preview[[
        "gene_symbol",
        "relation_primary",
        "title",
        "year",
        "score",
        "abstract_chars",
        "plain_text_chars",
    ]])

    saved_files = save_dataframe_rows_as_json(
        top_references,
        "../data/papers",
        id_column="PMID",
        filename_prefix="PMID",
        indent=4,
        drop_missing=True,
    )
    print(f"Saved {len(saved_files)} top-scoring JSON files to ../data/papers")


In [None]:
#Now once I have build a huge 