Usecase test 1: demonstrating that genes given in an article's abstract are found to be amoung the highest significance genes linked to that article in the KG. 

1. Retrieving all named genes from the 10 articles first retrieved from Pubmed (from the stored metadata file)

In [1]:
#import scispacy
import spacy
import csv
import re
spacy.load("en_core_web_sm")

#retrieve data from the metadata file
def get_csv_column(file_path, column_name):
    data = []
    with open(file_path, 'r') as csvfile:
        csvreader = csv.DictReader(csvfile)
        for row in csvreader:
            data.append(row[column_name])
    return data


def process_csv(file_path, search_terms, search_column, return_column):
    nlp = spacy.load("en_core_web_sm")
    results = []
    search_terms_lower = [term.lower() for term in search_terms]
    with open(file_path, 'r', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            search_text = row[search_column]
            return_value = row[return_column]
            doc = nlp(search_text)
            matched_terms = set(token.text for token in doc 
                                if token.text.lower() in search_terms_lower 
                                and not token.is_punct
                                and token.pos_ != "VERB")
            if matched_terms:
                results.append({
                    'matched_terms': list(matched_terms),
                    'return_value': return_value
                })
    return results


abstract_file = 'data/asd_article_metadata.csv'
gene_file = 'gene_list.csv'     #list of gene names taken from Ensembl 
column_name = 'Gene name'
gene_list = get_csv_column(gene_file, column_name)
search_column = 'abstract'      #will search through the columns of abstracts
return_column = 'pmid'           #will return the associated article pmid

results = process_csv(abstract_file, gene_list, search_column, return_column)

for result in results:
    print(f"Article PMID: {result['return_value']}")
    print(f"Matched gene terms: {result['matched_terms']}")
    
    print()

  return torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count


Article PMID: 28384108
Matched gene terms: ['CACNB1', 'CACNA1C', 'was']

Article PMID: 28485729
Matched gene terms: ['HRH4', 'HRH1', 'HRH3', 'HDC', 'set', 'was', 'HRH2', 'HNMT']

Article PMID: 30016992
Matched gene terms: ['set', 'PPP1R3F']

Article PMID: 30816183
Matched gene terms: ['Foxp2', 'Auts2', 'was']

Article PMID: 31719968
Matched gene terms: ['was']

Article PMID: 32015540
Matched gene terms: ['Pten', 'TCF7L2', 'TCF4']

Article PMID: 32365465
Matched gene terms: ['Mice', 'mice', 'was']

Article PMID: 33160303
Matched gene terms: ['was']

Article PMID: 33262327
Matched gene terms: ['NRXN2', 'ANK2', 'CHD8', 'ADNP2', 'SHANK3', 'ARID1B']

Article PMID: 34946850
Matched gene terms: ['set', 'TF']

Article PMID: 35710789
Matched gene terms: ['was']

Article PMID: 35962193
Matched gene terms: ['CIT']

Article PMID: 36213201
Matched gene terms: ['set']

Article PMID: 36688057
Matched gene terms: ['CD14']

Article PMID: 37381037
Matched gene terms: ['set', 'TEs', 'ATRX', 'impact']

Ar

Visual check for correct outputs (nb, out of 30 articles, only these 6 named relevant key genes within the abstract. They often instead mention groups such as "circRNAs" or braod categories such as "translational machinery"):

Article PMID: 28384108 - Matched gene terms: ['CACNB1', 'CACNA1C']

Article PMID: 28485729 - Matched gene terms: ['HRH4', 'HRH1', 'HRH3', 'HDC', 'HRH2', 'HNMT']

Article PMID: 30016992 - Matched gene terms: ['PPP1R3F']

Article PMID: 30816183 - Matched gene terms: ['Foxp2', 'Auts2']

Article PMID: 32015540 - Matched gene terms: ['Pten', 'TCF4']
nb. the article states "mutations in the TCF4 gene, but not the TCF7L2 gene"

Article PMID: 33262327 - Matched gene terms: ['NRXN2', 'ANK2', 'CHD8', 'ADNP2', 'SHANK3', 'ARID1B']

2. Querying the KG to retrieve the most significant genes within the datasets for each article: 

In [2]:
import rdflib
import re

filename = "main_graph.nt"
g = rdflib.Graph()
g.parse(filename, format="nt")

#query to return all data-rows with a pvalue and gene
query = """
    PREFIX EDAM: <http://edamontology.org/>
    PREFIX RDF: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX DCT: <http://purl.org/dc/terms/>
    PREFIX BIOLINK: <https://w3id.org/biolink/vocab/>
    PREFIX ENSEMBL: <http://identifiers.org/ensembl/>
    PREFIX NCBIGENE: <http://identifiers.org/ncbigene/>

    SELECT ?subject ?gene ?value
    WHERE {
        ?subject EDAM:data_2082 ?value .
        ?subject BIOLINK:symbol | ENSEMBL:id | NCBIGENE:id ?gene .
    }
"""
#will add logfold info too:
#        ?subject EDAM:data_3754 ?logfold .

results = g.query(query)
#print(f"Number of results: {len(list(results))}")


In [3]:
from collections import defaultdict
from decimal import Decimal, getcontext

getcontext().prec = 6   #to set 6 sig. digits.

# List of valid PMIDs with genes names in the abstract
valid_roots = ['28384108', '28485729', '30016992', '30816183', '32015540', '33262327']

def find_root(g, node, visited=None):
    if visited is None:
        visited = set()
    visited.add(node)
    parents = list(g.subjects(predicate=None, object=node))
    if not parents:
        return node
    for parent in parents:
        if parent not in visited:
            return find_root(g, parent, visited)
    return node

# store genes and p-values for each root(article)
root_data = defaultdict(list)

for row in results:
    try:
        value = Decimal(row.value)
        if value < Decimal('0.05'):
            subject = row.subject
            root = find_root(g, subject)
            # Only add data if the root is in the valid_roots list
            if any(valid_root in str(root) for valid_root in valid_roots):
                root_data[root].append((row.gene, value))
    except (ValueError, InvalidOperation):
        pass

for root, gene_data in root_data.items():
    print(f"Article: {root}")
    for gene, pvalue in sorted(gene_data, key=lambda x: x[1])[:10]:  # Sort by p-value and return first 10
        print(f"  Gene: {gene}, P-value: {pvalue:.6f}")  # Format to 6 decimal places
    print()

#need to repeat with  logfold figures

Article: https://pubmed.ncbi.nlm.nih.gov/32015540
  Gene: Gm10222, P-value: 0.000000
  Gene: Tcf4, P-value: 0.000000
  Gene: 21413.0, P-value: 0.000000
  Gene: Hoxb8, P-value: 0.000000
  Gene: 15416.0, P-value: 0.000000
  Gene: Pkd2l1, P-value: 0.000000
  Gene: 329064.0, P-value: 0.000000
  Gene: Lpl, P-value: 0.000000
  Gene: 16956.0, P-value: 0.000000
  Gene: Cd82, P-value: 0.000000

Article: https://pubmed.ncbi.nlm.nih.gov/28485729
  Gene: ENSG00000200959, P-value: 0.000000
  Gene: ENSG00000212443, P-value: 0.000000
  Gene: ENSG00000212232, P-value: 0.000004
  Gene: ENSG00000074935, P-value: 0.000007
  Gene: ENSG00000207008, P-value: 0.000009
  Gene: ENSG00000212402, P-value: 0.000015
  Gene: ENSG00000226637, P-value: 0.000019
  Gene: ENSG00000200406, P-value: 0.000029
  Gene: ENSG00000238864, P-value: 0.000056
  Gene: ENSG00000109016, P-value: 0.000062



(now can repeat with more articles and included conversion between gene formats (and logfold figure))