TODO:

- Experiment with using a simpler empty pipeline instead of en_core_web_sm or disable unwanted pipelines, see *Disabling pipeline components* in chapter 3 of learn-spaCy project
- Use *components with extensions* from chapter 3 of learn-spaCy project to look up the model category information e.g. training type

In [1]:
import pandas as pd
import spacy
import PyPDF2
import fitz  # PyMuPDF
import re

In [2]:
ROOT_FOLDER = '../../../'

In [3]:
DATA_FOLDER = '../../../data/'

In [4]:
DOWNLOAD_FOLDER = '../../../downloads/'

In [5]:
large_ml_lncRNA_search_df = pd.read_parquet(f'{DATA_FOLDER}large_ml_lncRNA_search_df.parquet')

Filter for files that are in English

In [6]:
large_ml_lncRNA_search_df = large_ml_lncRNA_search_df[large_ml_lncRNA_search_df['language'] == 'en']

In [7]:
large_ml_lncRNA_search_df

Unnamed: 0,title,abstract,year,url,author_id,query,file_name,file_path,language
12,LncMachine: a machine learning algorithm for l...,We evaluated the performance of machine learni...,2021,https://escholarship.org/content/qt32n7m7td/qt...,"[ZeGca3cAAAAJ, o3DdNZMAAAAJ, Ydo9ResAAAAJ, 4IR...",Machine Learning lncRNA,1fcc8c69-cc2d-48d1-9052-81b0154706cd.pdf,../../../downloads/1fcc8c69-cc2d-48d1-9052-81b...,en
29,DMFLDA: a deep learning framework for predicti...,"lncRNAs and diseases, and traditional machine ...",2020,https://minzeng1990.github.io/Files/TCBB-DMFLD...,"[Q6b80i8AAAAJ, wicacBwAAAAJ, kERS9vUAAAAJ, O_x...",Machine Learning lncRNA,e1460d83-589f-45ab-93fc-8b36160efa8d.pdf,../../../downloads/e1460d83-589f-45ab-93fc-8b3...,en
48,Evaluation of deep learning in non-coding RNA ...,"In this study, we review the progress of ncRNA...",2019,https://u.osu.edu/bmbl/files/2021/01/lncfinder...,"[xAqx-WkAAAAJ, 2tbe1RoAAAAJ, Vt5edEkAAAAJ]",Machine Learning lncRNA,59adb036-a0ef-4839-ab89-4f3990b7e26e.pdf,../../../downloads/59adb036-a0ef-4839-ab89-4f3...,en
58,A review of machine learning-based prediction ...,the subcellular localization of lncRNAs on a l...,2023,http://www.clausiuspress.com/assets/default/ar...,"[, ]",Machine Learning lncRNA,b0d97201-c1f7-435d-9f82-0a36c76a004f.pdf,../../../downloads/b0d97201-c1f7-435d-9f82-0a3...,en
89,LncRNA Subcellular Localization Signals–Are th...,at the 5’ end or 3’ end of lncRNA [2]. In this...,2023,https://par.nsf.gov/servlets/purl/10538687,"[Lcmc_iUAAAAJ, L8nlPxYAAAAJ, fEQEjCIAAAAJ, Wfz...",Machine Learning lncRNA,7889091f-b484-4e8d-bab2-0b150ba85ce3.pdf,../../../downloads/7889091f-b484-4e8d-bab2-0b1...,en
...,...,...,...,...,...,...,...,...,...
913,Mitochondrial Import of Malat1 Regulates Cardi...,machine learning in the classi cation of lncRN...,2020,https://scholar.archive.org/work/urt26oavknafp...,"[hhOh95IAAAAJ, SWozBjAAAAAJ, , X_RVrq4AAAAJ]",Machine Learning lncRNA,a8d9deec-c217-4bf1-86a8-c4bb8c10a6a2.pdf,../../../downloads/a8d9deec-c217-4bf1-86a8-c4b...,en
917,Impact of sequencing technologies on long non-...,"In summary, the analyzed tools use supervised ...",2022,https://www.biorxiv.org/content/biorxiv/early/...,"[, , ]",Machine Learning lncRNA,3ce36b96-ff84-4b84-9c17-f6fbdbbf1cef.pdf,../../../downloads/3ce36b96-ff84-4b84-9c17-f6f...,en
921,Deciphering the methylation landscape in breas...,Innovative automated machine learning was empl...,2021,http://repository-empedu-rd.ekt.gr/empedu-rd/b...,"[, hojAa00AAAAJ]",Machine Learning lncRNA,48c52af8-3a4f-40b9-91a8-dd04c8a41f1d.pdf,../../../downloads/48c52af8-3a4f-40b9-91a8-dd0...,en
928,Machine learning models for predicting lymph n...,These analyses should be extended to the effec...,2020,https://scholar.archive.org/work/bbbmcpqig5eyp...,"[, , , , , ]",Machine Learning lncRNA,b55db42f-8ffb-452b-81be-a8e4270ef0a3.pdf,../../../downloads/b55db42f-8ffb-452b-81be-a8e...,en


In [8]:
paper_titles = large_ml_lncRNA_search_df['title'].values.tolist()
paper_pdf_file_paths = large_ml_lncRNA_search_df['file_path'].values.tolist()

In [9]:
import spacy
from spacy.matcher import PhraseMatcher

# Load the EntityRuler patterns from a file
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.from_disk(f"{DATA_FOLDER}ml_entity_ruler_patterns")

<spacy.pipeline.entityruler.EntityRuler at 0x7d139e064650>

In [10]:
# Test the pipeline
text = "Support Vector Machine, Support Vector Machines, SVM, and S.V.M. are popular machine learning methods."
doc = nlp(text)

# Print detected entities
for ent in doc.ents:
    if ent.label_ == "ML_METHOD":
        print(ent.text, ent.label_, ent.id_)
        break

Support Vector Machine ML_METHOD SUPPORT_VECTOR_MACHINES


In [11]:
ent.text

'Support Vector Machine'

In [12]:
def extract_sections_from_pdf(pdf_file_path):
    """
    Extracts sections from a PDF and identifies specific sections like Introduction and Discussion,
    while ignoring References and Appendices.

    Args:
        pdf_file_path (str): Path to the PDF file.

    Returns:
        dict: A dictionary with section titles as keys and their content as values.
    """
    # Open the PDF file
    doc = fitz.open(pdf_file_path)
    
    # Initialize variables
    sections = {}
    current_section = None
    current_text = []

    # Regular expressions to identify sections
    section_pattern = re.compile(r"^(Introduction|Discussion|Conclusion|Methods|Results)$", re.IGNORECASE)
    #section_pattern = re.compile(r"^(Introduction|Discussion|Conclusion|Methods|Results|Appendix|Appendices)$", re.IGNORECASE)
    ignore_pattern = re.compile(r"^(References|Appendix|Appendices)$", re.IGNORECASE)
    #ignore_pattern = re.compile(r"^(References)$", re.IGNORECASE)

    for page in doc:
        # Extract text from the page
        text = page.get_text("text")

        # Split text into lines for processing
        lines = text.splitlines()

        for line in lines:
            line = line.strip()
            if not line:
                continue

            # Check if the line matches a section title
            section_match = section_pattern.match(line)
            ignore_match = ignore_pattern.match(line)

            if section_match:
                # Save the current section's text
                if current_section and current_text:
                    sections[current_section] = "\n".join(current_text)

                # Start a new section
                current_section = section_match.group(0).capitalize()
                current_text = []

            elif ignore_match:
                # Stop processing further sections if references or appendices are encountered
                if current_section and current_text:
                    sections[current_section] = "\n".join(current_text)
                current_section = None
                current_text = []

            elif current_section:
                # Append text to the current section
                current_text.append(line)

    # Save the last section's text
    if current_section and current_text:
        sections[current_section] = "\n".join(current_text)

    doc.close()
    return sections

In [15]:
def search_in_file(pdf_file_path):    
    search_results = dict()
    # Extract text from PDF
    with open(pdf_file_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        for page_num, page in enumerate(pdf_reader.pages):
            # Extract text from the page
            text = page.extract_text()
            if text:
                # Use spaCy to process the text
                doc = nlp(text)
                
                # Detect entities
                for ent in doc.ents:
                    if ent.label_ == "ML_METHOD":
                        if ent.id_ in search_results:
                            search_results[ent.id_] += 1  # Increment Count
                        else:
                            search_results[ent.id_] = 1 # Store count
    
    return search_results

Search through all papers and extract ML_METHOD info

In [30]:
paper_terms = list()

for paper_title, pdf_file_path in zip(paper_titles, paper_pdf_file_paths):
    print(f'processing: {paper_title}')
    result = search_in_file(pdf_file_path=pdf_file_path)
    paper_terms.append({'paper_title': paper_title, 'result': result})    

processing: LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants
processing: DMFLDA: a deep learning framework for predicting lncRNA–disease associations
processing: Evaluation of deep learning in non-coding RNA classification
processing: A review of machine learning-based prediction of lncRNA subcellular localization
processing: LncRNA Subcellular Localization Signals–Are the Two Ends Equal? A Machine Learning Analysis Across Multiple Cell Lines
processing: Identification of Characteristic lncRNA Molecular Markers in Osteoarthritis by Integrating GEO Database and Machine Learning Strategies and Experimental Validation
processing: Towards machine learning in molecular biology
processing: SDLDA: lncRNA-disease association prediction based on singular value decomposition and deep learning
processing: A systematic evaluation of the computational tools for lncRNA identification
processing: LncDLSM: identification of long non-coding RNAs with deep learning-ba

incorrect startxref pointer(2)


processing: An Optimized Technique for RNA Prediction Based on Neural Network
processing: A workflow combining single-cell CRISPRi screening and a supervised autoencoder neural network to detect subtle transcriptomic perturbations induced by lncRNA …
processing: LncRNA FAM83H-AS1 Amplification is Associated With a Poor Prognosis in Lung Adenocarcinoma and Can Serve as A Therapeutic Target
processing: Machine learning-informed liquid-liquid phase separation for personalized breast cancer treatment assessment
processing: HRGCNLDA: Forecasting of lncRNA-disease association based on hierarchical refinement graph convolutional neural network
processing: Comparison of performance of different k values with k-fold cross validation in a graph-based learning model for incrna-disease prediction
processing: GATLncLoc+ C&S: Prediction of LncRNA subcellular localization based on corrective graph attention network
processing: Identification of Candidate Diagnostic gene CDC25C for Trametinib-induced 



processing: … and Exploration of Immune Activation Pathways in T-cell Mediated Rejection through Integrated Bulk and Single-Cell RNA-Seq Analysis with Machine Learning
processing: Machine Learning Models for Human Synapse Genomics
processing: PREDICTING THE PRIMARY TISSUES OF CANCERS OF UNKNOWN PRIMARY USING MACHINE LEARNING
processing: Application of deep learning in genomics
processing: Machine learning-based diagnostic model of lymphatics-associated genes for new therapeutic target analysis in intervertebral disc degeneration
processing: Common Features in lncRNA Annotation and Classification: A Survey. Noncoding RNA. 2021; 7: 77
processing: SPIREX: Improving LLM-based relation extraction from RNA-focused scientific literature using graph machine learning
processing: On the prediction of mRNA subcellular localization with machine learning
processing: Machine learning methods for effectively discovering complex relationships in graph data
processing: Mitochondrial Import of Malat1 Re

Flatten the search results and store as dataframe

In [33]:
ml_lncRNA_search_result_df = pd.DataFrame([
    (x['paper_title'], term, cnt)
        for x in paper_terms
            for term, cnt in x['result'].items()
], columns = ['title', 'ml_term', 'cnt'])
    
    

In [34]:
ml_lncRNA_search_result_df

Unnamed: 0,title,ml_term,cnt
0,LncMachine: a machine learning algorithm for l...,RANDOM_FORESTS,9
1,LncMachine: a machine learning algorithm for l...,SUPPORT_VECTOR_MACHINES,11
2,LncMachine: a machine learning algorithm for l...,LOGISTIC_REGRESSION,2
3,LncMachine: a machine learning algorithm for l...,DECISION_TREES,1
4,LncMachine: a machine learning algorithm for l...,BOOSTING,1
...,...,...,...
496,Development of New Bioinformatic Approaches fo...,RANDOM_FORESTS,16
497,Development of New Bioinformatic Approaches fo...,SUPPORT_VECTOR_MACHINES,62
498,Development of New Bioinformatic Approaches fo...,PRINCIPAL_COMPONENT_ANALYSIS,1
499,Development of New Bioinformatic Approaches fo...,DECISION_TREES,1


In [35]:
ml_lncRNA_search_result_df.to_parquet(f'{DATA_FOLDER}ml_lncRNA_search_result_df.parquet')