# Sequence to function

## The problem
The problem we are working on is called sequence to function. Ideally this means we have a sequence and we infer some kind of function

### The sequence: 
To me this means a few different things. 
- Gene sequence mutations
- Gene and Protein Orthologs
- Post-translational modification

### The function: 
This also could mean a few things. My mind initially went to physiology, but now realized function could be alot broader. For example simple ligand enzyme binding could be related to funciton. The pathway that the protein is involved in, the regulation
- A change in ligand binding interactions (Initial GPCR activation)
- A change in metabolitic secondary activity (GPCRs downstream)

## The solution
When thinking about a solution to the hackathon problem. Ryan has suggested, could we start with building a full profile with only two data categories (for the sequence). To me this means: 
- Gene name, mutation/ortholog
- Protein name, post-translational modification

Given those could we pull out somekind of knowledge graph that will allow us to relate it to currently known research. This is where we will need the __agent__

In [1]:
import sys, os
sys.path.append("..")

from scripts.fetch_data import split_colon_list
from scripts.epmc_utils import fetch_epmc, fetch_epmc_batch_save_json, save_dataframe_rows_as_json
import pandas as pd


In [2]:
#This simple script creates the JSON directory if it doesn't exist. We are adding this to .gitignore, maybe it is too heavy?
if not os.path.isdir("../data/corpus/"):
    os.makedirs("../data/corpus/", exist_ok = True)
    print("Making a new directory")
else:
    print("Directory already exists")

Directory already exists


In [3]:
genage_human = pd.read_csv("../data/raw/genage_human.csv")
display(genage_human.head())
genes = genage_human['symbol'].values
print(len(genes))

Unnamed: 0,GenAge ID,symbol,name,entrez gene id,uniprot,why
0,1,GHR,growth hormone receptor,2690,GHR_HUMAN,mammal
1,2,GHRH,growth hormone releasing hormone,2691,SLIB_HUMAN,mammal
2,3,SHC1,SHC (Src homology 2 domain containing) transfo...,6464,SHC1_HUMAN,mammal
3,4,POU1F1,POU class 1 homeobox 1,5449,PIT1_HUMAN,mammal
4,5,PROP1,PROP paired-like homeobox 1,5626,PROP1_HUMAN,mammal


307


In [4]:
#Curate the uniprot data
# uniprot_data = fetch_uniprot_data(genes)
# uniprot_data.to_csv("../data/processed/uniprot_output.csv", index=False)

#OR just open the output
uniprot_data =pd.read_csv("../data/processed/uniprot_output.csv")
display(uniprot_data.head())

citation_list = [
    item
    for citation_title in uniprot_data.citation_titles
    for item in split_colon_list(citation_title)
]
len(citation_list)


Unnamed: 0,gene_symbol,uniprot_id,protein_name,sequence,pmids,dois,citation_titles,reviewed
0,GHR,P10912,Growth hormone receptor,MDLWQLLLTLALAGSSDAFSGSEATAAILSRAPWSLQSVNPGLKTN...,,,Growth hormone receptor and serum binding prot...,False
1,GHRH,P01286,Somatoliberin,MPLWVFFFVILTLSNSSHCSPPPPLTLRMRRYADAIFTNSYRKVLG...,,,Cloning and sequence analysis of cDNA for the ...,False
2,SHC1,P29353,SHC-transforming protein 1,MDLLPPKPKYNPLRNESLSSLEEGASGSTPPEELPSPSASSLGPIL...,,,A novel transforming protein (SHC) with an SH2...,False
3,POU1F1,P28069,Pituitary-specific positive transcription fact...,MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMS...,,,Cloning of the human cDNA for transcription fa...,False
4,PROP1,O75360,Homeobox protein prophet of Pit-1,MEAERRRQAEKPKKGRVGSNLLPERHPATGTPTTTVDSSAPPCRRL...,,,"Human Prop-1: cloning, mapping, genomic struct...",False


12387

In [None]:
uniprot_citations_df = fetch_epmc_batch_save_json(
    items = citation_list,
    directory = "../data/corpus/",
    id_column="PMID",
    filename_prefix="EPMC_",
    indent=4,
    drop_missing=True,
)
print(f"Saved {len(uniprot_citations_df)} UniProt citation JSON files.")

Fetching EPMC metadata for item 1 of 12387: Growth hormone receptor and serum binding protein: purification, cloning and expression.
Skipping existing file: EPMC_10519707.json
Fetching EPMC metadata for item 2 of 12387: Characterization of the human growth hormone receptor gene and demonstration of a partial gene deletion in two patients with Laron-type dwarfism.
Skipping existing file: EPMC_2813379.json
Fetching EPMC metadata for item 3 of 12387: Expression of a human growth hormone (hGH) receptor isoform is predicted by tissue-specific alternative splicing of exon 3 of the hGH receptor gene transcript.
Skipping existing file: EPMC_1569971.json
Fetching EPMC metadata for item 4 of 12387: Alternatively spliced forms in the cytoplasmic domain of the human growth hormone (GH) receptor regulate its ability to generate a soluble GH-binding protein.
Skipping existing file: EPMC_8855247.json
Fetching EPMC metadata for item 5 of 12387: A membrane-fixed, truncated isoform of the human growth h