Skip to content


Repository files navigation

Scripts to reproduce results reported in FamPlex manuscript

This repository is organized into six folders corresponding to each stage of constructing and evaluating the FamPlex resource.

Step 1: Get list of genes and corresponding PMIDs

  • Download Reactome signaling proteins list (see step1_genes_pmids/signaling_proteins_readme.txt)
    • output: signaling_proteins.tsv
  • Run to get a gene list for literature search
    • For each protein, get gene name.
    • output: signaling_genes.txt
    • Get genes identified in famplex/relations.csv (which included families from BEL and from the Ras family).
    • Combine the two gene lists
    • output: combined_genes.txt.
  • Run to get a list of PMIDs from gene list
    • For each gene in combined_genes.txt, get papers curated from Entrez gene.
    • Save dict mapping gene to list of papers
    • output: combined_genes_to_pmids.pkl
    • Make set of unique PMIDs, save
    • output: combined_pids.txt

Step 2: Build REACH without FamPlex and read PMIDs

  • Build REACH without FamPlex
  • output: combined_genes_no_famplex_stmts.pkl (the pickle file contains a dictionary mapping each PMID to a list of Statements)

Step 3: Shuffle the PMIDs and subsample the data into training and test sets

  • 80% of papers to training set, 20% of papers to test set
    • output: training_pmid_stmts.pkl
    • output: test_pmid_stmts.pkl
    • output: training_pmids.txt
    • output: test_pmids.txt

Step 4: stmt_entity_stats

  • Run Creates random sample of agents from training (or test) set for evaluation.
    • output: training_agents_sample.csv
    • output: training_agents.csv
    • output: training_ungrounded.csv
    • output: training_agent_distribution.pdf
    • output: training_ungrounded_distribution.pdf
  • Open training_agents_sample_curated.csv in Excel to curate
    • Curate agents. Code:
      • P: protein
      • F: family
      • C: complex
      • X: complex of families
      • S: small molecule
      • B: biological process
      • U: unknown/other
      • M: microRNA
  • For ease of curation, run to create an HTML table to check groundings in different databases.
  • Same for test set.

Step 5: Curate grounding map

  • helps in finding lexical synonyms for unmapped families.
  • Fraction of most frequent agents that in clude family or complex
    • Fraction of mentions
  • Fraction of most frequent ungrounded that include family or complex
    • Fraction of ungrounded mentions
  • Table with 3 columns Entity, Frequency, Category (F, C, X, or empty), Curator

Step 6: Evaluate FamPlex resource

  • Creates a plot of frequency of groundings to each entity


No description, website, or topics provided.






No releases published