TODO:

- Experiment with using a simpler empty pipeline instead of en_core_web_sm or disable unwanted pipelines, see *Disabling pipeline components* in chapter 3 of learn-spaCy project
- Use *components with extensions* from chapter 3 of learn-spaCy project to look up the model category information e.g. training type

In [1]:
import pandas as pd
import spacy
import PyPDF2
import fitz  # PyMuPDF
import re

In [2]:
ROOT_FOLDER = '../../../'

In [3]:
DATA_FOLDER = '../../../data/'

In [4]:
DOWNLOAD_FOLDER = '../../../downloads/'

In [5]:
large_ml_lncRNA_search_df = pd.read_parquet(f'{DATA_FOLDER}large_ml_lncRNA_search_df.parquet')

Filter for files that are in English

In [6]:
large_ml_lncRNA_search_df = large_ml_lncRNA_search_df[large_ml_lncRNA_search_df['language'] == 'en']

In [7]:
large_ml_lncRNA_search_df

Unnamed: 0,title,abstract,year,url,author_id,query,file_name,file_path,language
12,LncMachine: a machine learning algorithm for l...,We evaluated the performance of machine learni...,2021,https://escholarship.org/content/qt32n7m7td/qt...,"[ZeGca3cAAAAJ, o3DdNZMAAAAJ, Ydo9ResAAAAJ, 4IR...",Machine Learning lncRNA,1fcc8c69-cc2d-48d1-9052-81b0154706cd.pdf,../../../downloads/1fcc8c69-cc2d-48d1-9052-81b...,en
29,DMFLDA: a deep learning framework for predicti...,"lncRNAs and diseases, and traditional machine ...",2020,https://minzeng1990.github.io/Files/TCBB-DMFLD...,"[Q6b80i8AAAAJ, wicacBwAAAAJ, kERS9vUAAAAJ, O_x...",Machine Learning lncRNA,e1460d83-589f-45ab-93fc-8b36160efa8d.pdf,../../../downloads/e1460d83-589f-45ab-93fc-8b3...,en
48,Evaluation of deep learning in non-coding RNA ...,"In this study, we review the progress of ncRNA...",2019,https://u.osu.edu/bmbl/files/2021/01/lncfinder...,"[xAqx-WkAAAAJ, 2tbe1RoAAAAJ, Vt5edEkAAAAJ]",Machine Learning lncRNA,59adb036-a0ef-4839-ab89-4f3990b7e26e.pdf,../../../downloads/59adb036-a0ef-4839-ab89-4f3...,en
58,A review of machine learning-based prediction ...,the subcellular localization of lncRNAs on a l...,2023,http://www.clausiuspress.com/assets/default/ar...,"[, ]",Machine Learning lncRNA,b0d97201-c1f7-435d-9f82-0a36c76a004f.pdf,../../../downloads/b0d97201-c1f7-435d-9f82-0a3...,en
89,LncRNA Subcellular Localization Signals–Are th...,at the 5’ end or 3’ end of lncRNA [2]. In this...,2023,https://par.nsf.gov/servlets/purl/10538687,"[Lcmc_iUAAAAJ, L8nlPxYAAAAJ, fEQEjCIAAAAJ, Wfz...",Machine Learning lncRNA,7889091f-b484-4e8d-bab2-0b150ba85ce3.pdf,../../../downloads/7889091f-b484-4e8d-bab2-0b1...,en
...,...,...,...,...,...,...,...,...,...
913,Mitochondrial Import of Malat1 Regulates Cardi...,machine learning in the classi cation of lncRN...,2020,https://scholar.archive.org/work/urt26oavknafp...,"[hhOh95IAAAAJ, SWozBjAAAAAJ, , X_RVrq4AAAAJ]",Machine Learning lncRNA,a8d9deec-c217-4bf1-86a8-c4bb8c10a6a2.pdf,../../../downloads/a8d9deec-c217-4bf1-86a8-c4b...,en
917,Impact of sequencing technologies on long non-...,"In summary, the analyzed tools use supervised ...",2022,https://www.biorxiv.org/content/biorxiv/early/...,"[, , ]",Machine Learning lncRNA,3ce36b96-ff84-4b84-9c17-f6fbdbbf1cef.pdf,../../../downloads/3ce36b96-ff84-4b84-9c17-f6f...,en
921,Deciphering the methylation landscape in breas...,Innovative automated machine learning was empl...,2021,http://repository-empedu-rd.ekt.gr/empedu-rd/b...,"[, hojAa00AAAAJ]",Machine Learning lncRNA,48c52af8-3a4f-40b9-91a8-dd04c8a41f1d.pdf,../../../downloads/48c52af8-3a4f-40b9-91a8-dd0...,en
928,Machine learning models for predicting lymph n...,These analyses should be extended to the effec...,2020,https://scholar.archive.org/work/bbbmcpqig5eyp...,"[, , , , , ]",Machine Learning lncRNA,b55db42f-8ffb-452b-81be-a8e4270ef0a3.pdf,../../../downloads/b55db42f-8ffb-452b-81be-a8e...,en


In [8]:
paper_titles = large_ml_lncRNA_search_df['title'].values.tolist()
paper_pdf_file_paths = large_ml_lncRNA_search_df['file_path'].values.tolist()

In [9]:
import spacy
from spacy.matcher import PhraseMatcher

# Load the EntityRuler patterns from a file
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.from_disk(f"{DATA_FOLDER}ml_entity_ruler_patterns")

<spacy.pipeline.entityruler.EntityRuler at 0x7a428cd5d3d0>

In [10]:
# Test the pipeline
text = "Support Vector Machine, Support Vector Machines, SVM, and S.V.M. are popular machine learning methods."
doc = nlp(text)

# Print detected entities
for ent in doc.ents:
    if ent.label_ == "ML_METHOD":
        print(ent.text, ent.label_)
        break

Support Vector Machine ML_METHOD


In [11]:
ent.text

'Support Vector Machine'

In [12]:
def extract_sections_from_pdf(pdf_file_path):
    """
    Extracts sections from a PDF and identifies specific sections like Introduction and Discussion,
    while ignoring References and Appendices.

    Args:
        pdf_file_path (str): Path to the PDF file.

    Returns:
        dict: A dictionary with section titles as keys and their content as values.
    """
    # Open the PDF file
    doc = fitz.open(pdf_file_path)
    
    # Initialize variables
    sections = {}
    current_section = None
    current_text = []

    # Regular expressions to identify sections
    section_pattern = re.compile(r"^(Introduction|Discussion|Conclusion|Methods|Results)$", re.IGNORECASE)
    #section_pattern = re.compile(r"^(Introduction|Discussion|Conclusion|Methods|Results|Appendix|Appendices)$", re.IGNORECASE)
    ignore_pattern = re.compile(r"^(References|Appendix|Appendices)$", re.IGNORECASE)
    #ignore_pattern = re.compile(r"^(References)$", re.IGNORECASE)

    for page in doc:
        # Extract text from the page
        text = page.get_text("text")

        # Split text into lines for processing
        lines = text.splitlines()

        for line in lines:
            line = line.strip()
            if not line:
                continue

            # Check if the line matches a section title
            section_match = section_pattern.match(line)
            ignore_match = ignore_pattern.match(line)

            if section_match:
                # Save the current section's text
                if current_section and current_text:
                    sections[current_section] = "\n".join(current_text)

                # Start a new section
                current_section = section_match.group(0).capitalize()
                current_text = []

            elif ignore_match:
                # Stop processing further sections if references or appendices are encountered
                if current_section and current_text:
                    sections[current_section] = "\n".join(current_text)
                current_section = None
                current_text = []

            elif current_section:
                # Append text to the current section
                current_text.append(line)

    # Save the last section's text
    if current_section and current_text:
        sections[current_section] = "\n".join(current_text)

    doc.close()
    return sections

In [13]:
def search_in_file(pdf_file_path):    
    search_results = dict()
    # Extract text from PDF
    with open(pdf_file_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        for page_num, page in enumerate(pdf_reader.pages):
            # Extract text from the page
            text = page.extract_text()
            if text:
                # Use spaCy to process the text
                doc = nlp(text)
                
                # Detect entities
                for ent in doc.ents:
                    if ent.label_ == "ML_METHOD":
                        if ent.text in search_results:
                            search_results[ent.text].append(page_num + 1)  # Store page number (1-indexed)
                        else:
                            search_results[ent.text] = [page_num + 1]  # Store page number (1-indexed)
    
    # Display search results
    for term, pages in search_results.items():
        if pages:
            print(f"'{term}' found on page(s): {set(pages)}")
        #else:
        #    print(f"'{term}' not found in the document.")

In [14]:
for paper_title, pdf_file_path in zip(paper_titles, paper_pdf_file_paths):
    print('-'*80)
    print(f'processing: {paper_title}')
    print('-'*80)
    search_in_file(pdf_file_path=pdf_file_path)
    print('-'*80)
    break

--------------------------------------------------------------------------------
processing: LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants
--------------------------------------------------------------------------------
'Random Forest' found on page(s): {1, 3, 4, 9}
'Support Vector Machine' found on page(s): {2, 4, 5}
'SVM' found on page(s): {2, 4, 5}
'Logistic Regression' found on page(s): {2, 6}
'decision tree' found on page(s): {2}
'Random Forests' found on page(s): {2}
'boosting' found on page(s): {2}
'AdaBoost' found on page(s): {3, 4}
'Neural Networks' found on page(s): {9, 7}
'random forests' found on page(s): {9}
'support vector machine' found on page(s): {10}
'GMM' found on page(s): {10}
--------------------------------------------------------------------------------


In [25]:
cnt = 0
for paper_title, pdf_file_path in zip(paper_titles, paper_pdf_file_paths):
    print('-'*80)
    print(f'processing: {paper_title}')
    print('-'*80)
    result = extract_sections_from_pdf(pdf_file_path=pdf_file_path)
    print('-'*80)
    print(result)
    if cnt < 10:
        break
    else:
        cnt += 1

--------------------------------------------------------------------------------
processing: LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'Introduction': 'With current advances in high-throughput sequencing tech-\nnologies, a vast number of transcripts have been experimen-\ntally determined for a plethora of different species, including a\nnumber of plants, animals, insects, and microbes (Szymański\nand Barciszewski 2002; Claverie 2005; Mercer et al. 2011;\nCagirici et al. 2017; IWGSC 2018). Transcriptomic and geno-\nmic studies have revealed that although the lengths of many of\nthese transcripts are greater than 200 nucleotides, the majority\ndo not code for functional proteins (Pennisi 2012; Budak et al.\n2020). Such transcripts have been defined as long noncoding\nRNAs (lncRNAs). In

In [27]:
list(result.keys())

['Introduction', 'Results', 'Discussion']

In [16]:
search_results = dict()
for s in result.keys():
    text = result[s]
    doc = nlp(text)
                
    # Detect entities
    for ent in doc.ents:        
        if ent.label_ == "ML_METHOD":
            break
            if ent.text in search_results:
                search_results[ent.text] += 1
            else:
                search_results[ent.text] = 1  # Store the count

In [24]:
search_results

{}

In [18]:
ent.label_

'ML_METHOD'

In [19]:
ent.text

'AdaBoost'

In [20]:
ent

AdaBoost

In [23]:
ent.id_

''