#### **Objective of the Notebook:**
***
This notebook preprocesses and analyzes XML documents from the MICCAI 2023 conference with the aim to:
- **Extract Structured Information:** Transform raw XML data into a clean, organized format for detailed analysis.
- **Identify Key Content:** Extract headers, titles, and related text from XML documents.
- **Focus on Specific Research:** Identify and extract papers related to cancer topics.
- **Data Aggregation:** Aggregate data into structured dataframes to facilitate further analysis and insights extraction.

#### Input Data Expected:
- **XML Documents:** Properly formatted XML documents from MICCAI 2023 stored in a predefined directory on the user's system, accessible to the script.

#### Output Data/Files Generated:
- [Access all outputs](00MICCAI_total_outputs/03MICCAI_all_outputs)
- **Structured Data CSV:** `03MICCAI_notebook_df_paper_extractions_all_cleaned_.csv` containing parsed and cleaned information.
- **Cancer-Related CSV:** `03MICCAI_notebook_df_paper_extractions_cancer.csv` aggregating information from cancer-related papers.
- **Refined Categorization CSVs:** Files like `03MICCAI_notebook_df_paper_extractions_patients_and_cancer.csv` based on refined categorizations, such as mentions of patients in cancer-related papers.

#### Assumptions or Important Notes:
- **GROBID Client Requirement:** A running instance of the GROBID client on the local machine is crucial for processing XML documents.
- **Document Structure:** Assumes XML documents adhere to a consistent structure suitable for automated parsing.
- **File Renaming:** Includes functionality to rename files based on titles extracted from XML, indicating each document can be uniquely identified by its title.
- **Expected Headers and Formats:** Specific headers and textual formats are anticipated within the XML files for accurate parsing.
- **Error Handling:** Minimal error handling suggests XML files should not contain corrupt data or unexpected formatting errors.


***

#### **Input and Output Data**

| Type         | Description | File/Folder Name                                          |
|--------------|-------------|-----------------------------------------------------------|
| **Input**    | PDF Documents from MICCAI 2023 stored in a predefined directory on the user's system. | Predefined directory containing PDF files. |
| **Input**    | XML Documents from MICCAI 2023 stored in a predefined directory on the user's system. | Predefined directory containing XML files. |
| **Output**   | Structured Dataframe containing parsed and cleaned information. | `03MICCAI_notebook_df_paper_extractions_all_cleaned_.csv` |
| **Output**   | CSV aggregating information from cancer-related papers. | `03MICCAI_notebook_df_paper_extractions_cancer.csv` |
| **Output**   | CSV featuring refined categorizations such as patient mentions in cancer-related papers. | `03MICCAI_notebook_df_paper_extractions_patients_and_cancer.csv` |
| **Output**   | CSV of 100 randomly selected papers for annotation and further analysis. | `03MICCAI_notebook_100_randomly_selected_papers.csv` |







## **Automated Extraction and Analysis of MICCAI 2023 XML Documents**
***

Libraries and installations

In [1]:
#!pip install lxml

# Run GROBID in terminal before running the notebook
# Installation and running commands
# wget https://github.com/kermitt2/grobid/archive/0.8.0.zip
# unzip 0.8.0.zip
# cd grobid-0.8.0
# ./gradlew run

import os
import re
import pandas as pd
import numpy as np
from xml.etree import ElementTree as et
from lxml import etree 
from collections import Counter

In [2]:
def wrap_text(text, width=100):
    """
    A simple function to wrap text at a given width.
    """
    if pd.isnull(text):
        return text  # Handle NaN values
    
    wrapped_lines = []
    for paragraph in text.split('\n'):  # Splitting by existing newlines to preserve paragraph breaks
        line = ''
        for word in paragraph.split():
            if len(line) + len(word) + 1 > width:
                wrapped_lines.append(line)
                line = word
            else:
                line += (' ' + word if line else word)
        wrapped_lines.append(line)
    return '\n'.join(wrapped_lines)

In [3]:
from grobid_client.grobid_client import GrobidClient
client = GrobidClient(grobid_server='http://localhost:8070')

GROBID server is up and running


In [4]:
# Folder where the PDF articles are stored (in this case, the MICCAI articles)
process_file = '../miccai_articles' 

# Process the full text of the PDF articles using GROBID
client.process('processFulltextDocument', process_file, output="./03MICCAI_notebook_GROBID_processed_volumes", force=True)

### Renaming XML files by folder path
***

Error in Renaming two GROBID XML files: Manually corrections on two XML files 

In [5]:
''' 
File #1.
FileNotFoundError: [Errno 2] No such file or directory: '03MICCAI_notebook_GROBID_processed_volumes/vol7/paper_44.grobid.tei.xml' 
-> '03MICCAI_notebook_GROBID_processed_volumes/vol7/Full_Image-Index_Remainder_Based_Single_Low-Dose_DR/CT_Self-supervised_Denoising.xml'

Solution: Removing "/CT_Self-supervised_Denoising" from the XML file in paper_44 in vol7 and save the updated version
Re-run the code block again

File #2
FileNotFoundError: [Errno 2] No such file or directory: '03MICCAI_notebook_GROBID_processed_volumes/vol9/paper_49.grobid.tei.xml' 
-> '03MICCAI_notebook_GROBID_processed_volumes/vol9/A_Patient-Specific_Self-supervised_Model_for_Automatic_X-Ray/CT_Registration.xml'

Solution: Removing "/CT_Registration" from the XML file in paper_49 in vol9 and save the updated version
Re-run the code block again
'''

' \nFile #1.\nFileNotFoundError: [Errno 2] No such file or directory: \'03MICCAI_notebook_GROBID_processed_volumes/vol7/paper_44.grobid.tei.xml\' \n-> \'03MICCAI_notebook_GROBID_processed_volumes/vol7/Full_Image-Index_Remainder_Based_Single_Low-Dose_DR/CT_Self-supervised_Denoising.xml\'\n\nSolution: Removing "/CT_Self-supervised_Denoising" from the XML file in paper_44 in vol7 and save the updated version\nRe-run the code block again\n\nFile #2\nFileNotFoundError: [Errno 2] No such file or directory: \'03MICCAI_notebook_GROBID_processed_volumes/vol9/paper_49.grobid.tei.xml\' \n-> \'03MICCAI_notebook_GROBID_processed_volumes/vol9/A_Patient-Specific_Self-supervised_Model_for_Automatic_X-Ray/CT_Registration.xml\'\n\nSolution: Removing "/CT_Registration" from the XML file in paper_49 in vol9 and save the updated version\nRe-run the code block again\n'

In [6]:
#from pandas._libs import missing

'''
The following code block is used to extract the title of the papers from the XML files,
and rename the XML files based on the title of the papers.
'''

def find_title(element):
    """Recursively search for the title element in the XML structure."""
    if 'title' in element.tag.lower() and element.text:
        return element.text.strip()
    for child in element:
        title = find_title(child)
        if title:
            return title
    return None

def rename_xml_files_in_folder(folder_path):
    """Rename XML files based on their title tags."""
    for filename in os.listdir(folder_path):
        if not filename.endswith('.xml'):  # Skip non-XML files
            continue
        
        file_path = os.path.join(folder_path, filename)
        try:
            tree = et.parse(file_path)
            root = tree.getroot()
            paper_title = find_title(root)
            if paper_title:
                new_filename = paper_title.replace(" ", "_") + '.xml'
                new_file_path = os.path.join(folder_path, new_filename)
                os.rename(file_path, new_file_path)
                print(f"Renamed '{filename}' to '{new_filename}'")
            else:
                print(f"Title not found in '{filename}'. Skipping.")
        except et.ParseError as e:
            print(f"Error parsing '{filename}': {e}")

# Path to the processed volumes
path = '03MICCAI_notebook_GROBID_processed_volumes'

# Rename the XML files in the sub folders to the actual title of the article
for volume_number in range(1, 11):
    # Construct the path to the volume folder
    folder_path = f'{path}/vol{volume_number}'
    print(f"Processing '{folder_path}'...")

    # Rename the XML files in the folder
    rename_xml_files_in_folder(folder_path)

Processing '03MICCAI_notebook_GROBID_processed_volumes/vol1'...
Renamed 'paper_63.grobid.tei.xml' to 'Masked_Frequency_Consistency_for_Domain-Adaptive_Semantic_Segmentation_of_Laparoscopic_Images.xml'
Renamed 'paper_12.grobid.tei.xml' to 'Additional_Positive_Enables_Better_Representation_Learning_for_Medical_Images.xml'
Renamed 'paper_71.grobid.tei.xml' to 'Black-box_Domain_Adaptative_Cell_Segmentation_via_Multi-source_Distillation.xml'
Renamed 'paper_64.grobid.tei.xml' to 'Pick_the_Best_Pre-trained_Model:_Towards_Transferability_Estimation_for_Medical_Image_Segmentation.xml'
Renamed 'paper_15.grobid.tei.xml' to 'Automatic_Retrieval_of_Corresponding_US_Views_in_Longitudinal_Examinations.xml'
Renamed 'paper_65.grobid.tei.xml' to 'Source-Free_Domain_Adaptive_Fundus_Image_Segmentation_with_Class-Balanced_Mean_Teacher.xml'
Renamed 'paper_70.grobid.tei.xml' to 'Cross-Dataset_Adaptation_for_Instrument_Classification_in_Cataract_Surgery_Videos.xml'
Renamed 'paper_14.grobid.tei.xml' to '3D_Art

FileNotFoundError: [Errno 2] No such file or directory: '03MICCAI_notebook_GROBID_processed_volumes/vol7/paper_44.grobid.tei.xml' -> '03MICCAI_notebook_GROBID_processed_volumes/vol7/Full_Image-Index_Remainder_Based_Single_Low-Dose_DR/CT_Self-supervised_Denoising.xml'

In [7]:
'''
The missing XML files in the processed volumes are:
- vol6: paper_15.grobid.tei.xml
- vol7: paper_13.grobid.tei.xml

They were localised by re-running the code block 'process_file = '../vol6' and 'process_file = '../vol7' 
in the client.process() function and stores into two separate folders named './06' and './07'. 
From here, the missing XML files were identified and the corresponding XML files were renamed to the correct title
and moved to the correct folder in the processed volumes folder (03MICCAI_notebook_GROBID_processed_volumes).
'''
# For finding missing XML files in the processed volumes:
# vol6: paper_15.grobid.tei.xml
# process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/project_submission/00_project_notebook/miccai_papers/vol6' 
# client.process('processFulltextDocument', process_file, output="./06", force=True)

# vol7: paper_13.grobid.tei.xml
# process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/project_submission/00_project_notebook/miccai_papers/vol7' 
# client.process('processFulltextDocument', process_file, output="./07", force=True)


# Missing XML files localised from the client.process() function
missing_files = {'06/paper_15.grobid.tei.xml', '07/paper_13.grobid.tei.xml'}

# Rename the missing files XML files in the sub folders to the actual title of the article
for file_path in missing_files:
    tree = et.parse(file_path)
    root = tree.getroot()
    paper_title = find_title(root)
    if paper_title:
        new_filename = paper_title.replace(" ", "_") + '.xml'
        new_file_path = os.path.join(os.path.dirname(file_path), new_filename)
        os.rename(file_path, new_file_path)
        print(f"Renamed '{file_path}' to '{new_filename}'")
    else:
        print(f"Title not found in '{file_path}'. Skipping.")

FileNotFoundError: [Errno 2] No such file or directory: '07/paper_13.grobid.tei.xml'

#### **First run of the notebook resulted in manually renaming the xml files by**:
***

1. localising the text in the Abstract tag in the xml files named "Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml".
2. searching for matching text in the original pdf files stored in the "miccai_papers" directory and related subdirectories.
    - if the text is found, the title of the paper is copied from the pdf file and used to rename the xml file.
3. renaming the title of the xml file by:
    - localising the title tag in the xml file,
    - pasting the title of the paper from the pdf file, and
    - saving the updated xml file.

**Renaming files**:

- **File #1**: Dual Conditioned Diffusion Models for Out-of-Distribution Detection: Application to Fetal Ultrasound Videos
    - **original**: path + '/vol1/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
    - **renamed to**: '03MICCAI_notebook_GROBID_processed_volumes/vol1/Dual_Conditioned_Diffusion_Models_for_Out-of-Distribution_Detection:_Application_to_Fetal_Ultrasound_Videos.xml'

- **File #2**: COLosSAL: A Benchmark for Cold-Start Active Learning for 3D Medical Image Segmentation
    - **original**: path + '/vol2/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
    - **renamed to**: '03MICCAI_notebook_GROBID_processed_volumes/vol2/COLosSAL:_A_Benchmark_for_Cold-Start_Active_Learning_for_3D_Medical_Image_Segmentation.xml'

- **File #3**: Learnable Cross-modal Knowledge Distillation for Multi-modal Learning with Missing Modality
    - **original**: path + '/vol4/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
    - **renamed to**: (details not provided)

- **File #4**: (...)

- **File #6**: Pelphix: Surgical Phase Recognition from X-Ray Images in Percutaneous Pelvic Fixation
    - **original**: path + '/vol9/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
    - **renamed to**: '03MICCAI_notebook_GROBID_processed_volumes/vol9/Pelphix:_Surgical_Phase_Recognition_from_X-Ray_Images_in_Percutaneous_Pelvic_Fixation.xml'

In [8]:
# Path to the processed volumes folder
path = "03MICCAI_notebook_GROBID_processed_volumes"

In [9]:
# FIRST RUN - RENAMING THE XML FILES BASED ON THE TITLE TAGS:
########################################################################################

# Load and parse the XML file 
# file_path = path + '/vol1/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol2/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol4/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol6/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol7/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol9/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'

# Load and parse the XML file
tree = et.parse(file_path)
root = tree.getroot()

# Since XML namespaces can complicate direct tag access, we find the title tag dynamically.
# This approach is based on the assumption that titles are relatively unique in structure.

# Attempt to extract the paper title. This might need adjustments based on the actual structure.
title = None
for elem in root.iter():
    if 'title' in elem.tag.lower():
        title = elem.text
        break

title_clean = title.strip().replace(" ", "_") if title else "Untitled_Document"

# Attempt a more generic search for the title, considering common patterns in scholarly articles
# We'll look for title elements that might be nested within other elements (like "titleStmt" or "fileDesc" in TEI format)

def find_title(element):
    """
    Recursively search for the title element in the XML structure.
    """
    if 'title' in element.tag.lower() and element.text:
        return element.text.strip()
    for child in element:
        title = find_title(child)
        if title:
            return title
    return None

# Attempt to find the title using the recursive search
paper_title = find_title(root)
paper_title_clean = paper_title.replace(" ", "_") if paper_title else "Untitled_Document"


# Define the new file path with the clean title
new_file_path = os.path.join(os.path.dirname(file_path), f"{paper_title_clean}.xml")

# Rename the file
os.rename(file_path, new_file_path)

new_file_path

FileNotFoundError: [Errno 2] No such file or directory: '07/paper_13.grobid.tei.xml'

In [10]:
# SECOND RUN - RENAMING THE MISSING XML FILES BY SAME METHOD:
########################################################################################

''' 
2 unique titles missing, 1 in vol6 and vol7 each, where both files were named 
"Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml" instead of the actual title
in the previous code block. 

The missing files are:
- vol6: A Multi-task Method for Immunofixation Electrophoresis Image Classification (paper_15)
- vol7: DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction (paper_13)

Now, we will rename the files based on the actual titles.
'''

# 2 unique titles missing, 1 in vol6 and vol7 each:
# vol6: A Multi-task Method for Immunofixation Electrophoresis Image Classification (paper_15)
#file_path = '06/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'

# vol7: DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction (paper_13)
# file_path = '07/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'

########################################################################################

# Load and parse the XML file
tree = et.parse(file_path)
root = tree.getroot()

# Since XML namespaces can complicate direct tag access, we find the title tag dynamically.
# This approach is based on the assumption that titles are relatively unique in structure.

# Attempt to extract the paper title. This might need adjustments based on the actual structure.
title = None
for elem in root.iter():
    if 'title' in elem.tag.lower():
        title = elem.text
        break

title_clean = title.strip().replace(" ", "_") if title else "Untitled_Document"

# Attempt a more generic search for the title, considering common patterns in scholarly articles
# We'll look for title elements that might be nested within other elements (like "titleStmt" or "fileDesc" in TEI format)

def find_title(element):
    """
    Recursively search for the title element in the XML structure.
    """
    if 'title' in element.tag.lower() and element.text:
        return element.text.strip()
    for child in element:
        title = find_title(child)
        if title:
            return title
    return None

# Attempt to find the title using the recursive search
paper_title = find_title(root)
paper_title_clean = paper_title.replace(" ", "_") if paper_title else "Untitled_Document"


# Define the new file path with the clean title
new_file_path = os.path.join(os.path.dirname(file_path), f"{paper_title_clean}.xml")

# Rename the file
os.rename(file_path, new_file_path)

new_file_path

FileNotFoundError: [Errno 2] No such file or directory: '07/paper_13.grobid.tei.xml'

In [11]:
from lxml import etree 

# Function to parse the XML files and extract the headers
def parse_xml_and_extract_headers(file_path):
    tree = etree.parse(file_path)
    root = tree.getroot()
    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

    # Extract the paper title by XPath in the XML's structure
    paper_title_element = root.find('.//tei:title', ns)

    # If the title is not found, set a default value
    paper_title = paper_title_element.text if paper_title_element is not None else "No Title Found"

    # Extract all headers in the document
    headers = root.xpath('//tei:head', namespaces=ns)
    print(f"Found {len(headers)} headers in '{paper_title}'")
    
    data = [] # List to store the extracted data
    for header in headers:
        # Use XPath string() function to get all text within the <p> tags, including nested elements
        text_content = ''.join(header.getparent().xpath('.//tei:p//text()', namespaces=ns))

        # Organize the extracted data into a dictionary of key-value pairs
        data.append({
            'Paper Title': paper_title,
            'Header Number': header.get('n'),
            'Header Title': header.text,
            'Text': text_content  # Updated to use text_content
        })

    # Create a DataFrame from the extracted data
    df = pd.DataFrame(data, columns=['Paper Title', 'Header Number', 'Header Title', 'Text'])
    return df

# Path to the processed volumes folder to extract the headers from the XML files and create a DataFrame with all headers 
def process_xml_folder(folder_path):
    all_data_frames = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith(".xml"):
            file_path = os.path.join(folder_path, file_name)
            df = parse_xml_and_extract_headers(file_path)
            all_data_frames.append(df)

    # Concatenate all DataFrames into a single one
    if all_data_frames:
        final_df = pd.concat(all_data_frames, ignore_index=True)
    else:
        final_df = pd.DataFrame()

    return final_df

# Folder path - where XML files are be stored 
folder_path = '03MICCAI_notebook_GROBID_processed_volumes/*'

# Create a folder to store the DataFrames
output_folder = '03MICCAI_notebook_GROBID_dataframes'
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# # Process the folder and create a DataFrame with all headers
for volume_number in range(1, 11):
    folder_path = f'{path}/vol{volume_number}'
    print(f"Creating a DataFrame with Extracted Data From'{folder_path}'...")
    df_headers = process_xml_folder(folder_path)
    df_headers.to_csv(f"03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol{volume_number}_headers.csv", index=False)

Creating a DataFrame with Extracted Data From'03MICCAI_notebook_GROBID_processed_volumes/vol1'...
Found 13 headers in 'AMAE: Adaptation of Pre-trained Masked Autoencoder for Dual-Distribution Anomaly Detection in Chest X-Rays'
Found 19 headers in 'Unsupervised Domain Adaptation for Anatomical Landmark Detection'
Found 13 headers in 'CT-Guided, Unsupervised Super-Resolution Reconstruction of Single 3D Magnetic Resonance Image'
Found 18 headers in 'Multi-scale Cross-restoration Framework for Electrocardiogram Anomaly Detection'
Found 12 headers in 'Multi-modal Variational Autoencoders for Normative Modelling Across Multiple Imaging Modalities'
Found 18 headers in 'Dense Transformer based Enhanced Coding Network for Unsupervised Metal Artifact Reduction'
Found 22 headers in 'MedIM: Boost Medical Image Representation via Radiology Report-Guided Masking'
Found 17 headers in 'Unsupervised Domain Transfer with Conditional Invertible Neural Networks'
Found 19 headers in 'Anatomy-Driven Patholo

In [13]:
"""
The following code block is used to clean the DataFrames created from the XML files for each volume.
The cleaning process involves removing rows where both 'Header Title' and 'Text' are NaN or just 'Text' is NaN.

The cleaned DataFrames are saved to CSV files in a new folder named: 
- '03MICCAI_notebook_cleaned_dataframes'.

The DataFrames are loaded from the: 
- '03MICCAI_notebook_GROBID_dataframes' folder and saved to the '03MICCAI_notebook_cleaned_dataframes' folder.
"""

def process_xml_folder(folder_path):
    df = pd.read_csv(folder_path)
    return df

def clean_dataframe(df):
    # Remove rows where both 'Header Title' and 'Text' are NaN or just 'Text' is NaN
    df_cleaned = df.dropna(subset=['Header Title', 'Text'], how='all')
    df_cleaned = df_cleaned.dropna(subset=['Text'], how='any')
    return df_cleaned

# Dictionary to store cleaned DataFrames
cleaned_dataframes = {}

# Base path where all processed volumes are stored
base_path = '03MICCAI_notebook_GROBID_dataframes'


output_folder = '03MICCAI_notebook_cleaned_dataframes'
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

#/03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol1_headers.csv
# Cleaning the DataFrames created from the XML files for each volume
for volume_number in range(1, 11):
    folder_path = f'{base_path}/03MICCAI_notebook_df_vol{volume_number}_headers.csv'
    print(f"Cleaning DataFrame for '{folder_path}'...")
    # Process the XML files in the folder and create a DataFrame
    df_headers = process_xml_folder(folder_path)
    # Clean the DataFrame
    df_cleaned = clean_dataframe(df_headers)
    # Save the cleaned DataFrame to a CSV file in the output folder
    df_cleaned.to_csv(f"03MICCAI_notebook_cleaned_dataframes/03MICCAI_notebook_df_vol{volume_number}_cleaned.csv", index=False)

Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol1_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol2_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol3_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol4_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol5_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol6_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol7_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol8_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_vol9_headers.csv'...
Cleaning DataFrame for '03MICCAI_notebook_GROBID_dataframes/03MICCAI_notebook_df_v

In [14]:
# Check the number of unique paper titles in the cleaned DataFrames for volume 10
len(df_cleaned['Paper Title'].unique())

74

In [15]:
# Read in the folder with the cleaned_dataframes
folder_path = '03MICCAI_notebook_cleaned_dataframes'

# Verify unique title counts in the individual and saved dataframes before combining
total_unique = 0
for volume_number in range(1, 11):
    file_path = f'{folder_path}/03MICCAI_notebook_df_vol{volume_number}_cleaned.csv'
    df = pd.read_csv(file_path)
    unique_in_df = len(df['Paper Title'].unique())
    print(f"Unique titles in vol{volume_number}: {unique_in_df}")
    total_unique += unique_in_df

print(f"Sum of unique titles from individual DataFrames: {total_unique}") 

Unique titles in vol1: 73
Unique titles in vol2: 73
Unique titles in vol3: 72
Unique titles in vol4: 75
Unique titles in vol5: 76
Unique titles in vol6: 76
Unique titles in vol7: 74
Unique titles in vol8: 65
Unique titles in vol9: 70
Unique titles in vol10: 74
Sum of unique titles from individual DataFrames: 728


In [None]:
"""
Note: 2 unique title were missing from vol6 and vol7. 
To solve this issue I had to manually localise the missing papers and add them into the dataframe by 
processing the sub folders into the client.process() function, where I got:

- vol6: paper_15.grobid.tei.xml
- vol7: paper_13.grobid.tei.xml

The missing XML files were renamed based on the actual title of the papers and moved to the correct folder in the processed volumes folder.

Secondly, I had to manually rename title in the vol-related dataframes to the correct title of the papers since some of the XML files were named
"Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml" instead of the actual title of the papers. These papers were lost in the
process of creating the dataframes and concatenating them into a single dataframe.

Therefore, the final results are as follows:
- vol6: A Multi-task Method for Immunofixation Electrophoresis Image Classification
- vol7: DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

The final results are stored in the '03MICCAI_notebook_cleaned_dataframes' folder as CSV files in the format:
- '03MICCAI_notebook_df_vol{volume_number}_cleaned.csv'.

You can access the final results here:
- the '00MICCAI_total_outputs' > '03MICCAI_all_outputs' folder.  
"""

# 2 unique titles missing, 1 in vol6 and vol7 each:
# vol6: A Multi-task Method for Immunofixation Electrophoresis Image Classification (paper_15)
# vol7: DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

# 2 titles are named "Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023" in the dataframes
# vol6: AR2T:_Adaptive_Robust_Regression_Tree_for_Medical_Image_Segmentation
# vol7: Virtual Heart Models Help Elucidate the Role of Border Zone in Sustained Monomorphic Ventricular Tachycardia
# rename the files in the folder and re-run the code block

In [16]:
# Merge all cleaned DataFrames into a single DataFrame
all_cleaned_dataframes = []
for volume_number in range(1, 11):
    file_path = f'{folder_path}/03MICCAI_notebook_df_vol{volume_number}_cleaned.csv'
    df = pd.read_csv(file_path)
    all_cleaned_dataframes.append(df)

# Convert the list of DataFrames into a single DataFrame
df_all_cleaned = pd.concat(all_cleaned_dataframes, ignore_index=True)

In [17]:
print(df_all_cleaned['Paper Title'].nunique()) # 730 unique titles in the combined DataFrame
df_all_cleaned.head()

# Save the combined DataFrame to a CSV file
# df_all_cleaned.to_csv("03MICCAI_notebook_df_paper_extractions_all_cleaned_.csv", index=False)

723


Unnamed: 0,Paper Title,Header Number,Header Title,Text
0,AMAE: Adaptation of Pre-trained Masked Autoenc...,1.0,Introduction,To reduce radiologists' reading burden and mak...
1,AMAE: Adaptation of Pre-trained Masked Autoenc...,2.0,Method,Notation. We first formally define the problem...
2,AMAE: Adaptation of Pre-trained Masked Autoenc...,2.1,Stage 1-Proxy Task to Detect Synthetic Anomalies,AMAE starts the first training stage using onl...
3,AMAE: Adaptation of Pre-trained Masked Autoenc...,2.2,Stage 2-MAE Inter-Discrepancy Adaptation,The proposed MAE adaptation scheme is inspired...
4,AMAE: Adaptation of Pre-trained Masked Autoenc...,3.0,Experiments,Datasets. We evaluated our method on three pub...


In [18]:
df_all_cleaned = df_all_cleaned.fillna(0)

# Lowercase all text in the 'Text' column
df_all_cleaned['Text'] = df_all_cleaned['Text'].str.lower()
df_all_cleaned['Text'] = df_all_cleaned['Text'].apply(wrap_text, width = 80)

# Regular expression with str.replace to remove the volume information
df_all_cleaned['Paper Title'] = df_all_cleaned['Paper Title'].str.replace(r'\s*\(vol\d+\)', '', regex=True)

# Rename the columns to lowercase
df_all_cleaned.rename(columns={'Paper Title': 'title', 'Header Number':'header_no', 'Header Title': 'header_title', 'Text':'text', 'Volume': 'volume'}, inplace=True)

# Save the cleaned DataFrame to a CSV file
df_all_cleaned.to_csv('03MICCAI_notebook_df_paper_extractions_all_cleaned.csv', index=False)

In [19]:
# Search for 'cancer' in the Text column, case insensitive
cancer_papers_mask = df_all_cleaned['text'].str.contains('cancer|tumor|tumour', case=False, na=False)
papers_with_cancer = df_all_cleaned[cancer_papers_mask]

# Get the unique titles of papers that mention 'cancer'
unique_titles_with_cancer = papers_with_cancer['title'].unique()

# Extract all headers and their related text for papers that mention 'cancer'
extracted_info = pd.DataFrame()
for title in unique_titles_with_cancer:
    paper_info = df_all_cleaned[df_all_cleaned['title'] == title]
    extracted_info = pd.concat([extracted_info, paper_info])

# Reset index of the resulting DataFrame
extracted_info.reset_index(drop=True, inplace=True)

unique_paper_titles_with_cancer = extracted_info['title'].unique()
print(len(unique_paper_titles_with_cancer)) # Number of unique papers that mention 'cancer' is 263

# Save the extracted information to a CSV file for further analysis or processing
"""The manually processed and implemented CSV file is stored in the '00MICCAI_total_outputs' > '03MICCAI_all_outputs' folder"""
# extracted_info.to_csv("03MICCAI_notebook_df_paper_extractions_cancer.csv", index=False)

# Display the first few rows of the resulting DataFrame
extracted_info.head() 

262


Unnamed: 0,title,header_no,header_title,text
0,Anatomy-Driven Pathology Detection on Chest X-...,1.0,Introduction,chest radiographs (chest x-rays) represent the...
1,Anatomy-Driven Pathology Detection on Chest X-...,2.0,Related Work,weakly supervised pathology detection. due to ...
2,Anatomy-Driven Pathology Detection on Chest X-...,3.1,Model,figure 1 provides an overview of our method. g...
3,Anatomy-Driven Pathology Detection on Chest X-...,3.2,Inference,"during inference, the trained model predicts a..."
4,Anatomy-Driven Pathology Detection on Chest X-...,3.3,Training,the anatomical region detector is trained usin...


In [20]:
categories = {
    'age': ['age', 'age', 'young', 'old', 'gender'],
    'gender': ['gender', 'sex', 'women', 'woman', 'female', 'male'],
    'ethnicity': ['ethnicity', 'ethnicities', 'race', 'white patients', 'black patients'],
    'location_info': ['geolocation', 'geographical', 'geographic', 'country', 'countries', 
                    'city', 'cities', 'hospital', 'hospitals', 'clinic', 'clinics', 'continent',
                    'province', 'state', 'region', 'town', 'village', 'area', 'district'],
    'patients': ['patient', 'patients'],
    'dataset_info': ['dataset', 'datasets', 'data set', 'data sets', 'publicly', 'public', 'private', 'open access', 'open-access'],
    'bias_info': ['bias', 'biases', 'fairness'],
    
}

# Flatten the list of all keywords excluding 'patients' to avoid redundancy
all_keywords = sum([kw for cat, kw in categories.items() if cat != 'patients'], [])

# Filter papers that mention 'patient' or 'patients'
scope_mask = extracted_info['text'].str.contains('patient|patients', case=False, na=False)
papers_with_patients = extracted_info[scope_mask]

# Prepare a list to collect paper info dictionaries
papers_info_list = []

# Iterate over unique titles in the filtered DataFrame
for title in papers_with_patients['title'].unique():
    paper_info = papers_with_patients[papers_with_patients['title'] == title]
    # Initialize a dictionary for the current paper with zeros for all keywords
    paper_keywords = dict.fromkeys(all_keywords, 0)
    paper_keywords['title'] = title
    # Check for each keyword in the text of the paper
    for keyword in all_keywords:
        if any(paper_info['text'].str.contains(keyword, case=False, na=False)):
            paper_keywords[keyword] = 1
    # Collect the keyword matches for the current paper
    papers_info_list.append(paper_keywords)

# Create a DataFrame from the list of dictionaries
keywords_per_paper = pd.DataFrame(papers_info_list)

# Save the DataFrame to a CSV file for further analysis
# papers_with_patients.to_csv("03MICCAI_notebook_df_paper_extractions_patients_and_cancer.csv", index=False)

# The total number of unique titles will be 156 if the extractions of missing papers hasn't been processed manually in previous code
print('Number of unique titles for papers containing the keyword <patient/patients>:', len(papers_with_patients['title'].unique())) # 155
papers_with_patients.head()

Number of unique titles for papers containing the keyword <patient/patients>: 156


Unnamed: 0,title,header_no,header_title,text
5,Anatomy-Driven Pathology Detection on Chest X-...,3.4,Dataset,training dataset. we train on the chest imagen...
18,Self-supervised Learning for Physiologically-B...,2.4,Dataset,the dataset is composed of 23 oncological pati...
20,Self-supervised Learning for Physiologically-B...,0.0,(Color figure online),the most important design choice is the select...
21,Self-supervised Learning for Physiologically-B...,4.0,Discussion,even though the choice of the final activation...
32,AME-CAM: Attentive Multiple-Exit CAM for Weakl...,5.0,Conclusion,"in this work, we propose a brain tumor segment..."


In [21]:
"""
For the final step, we will randomly select 100 unique papers from the DataFrame containing papers with patients.
The selected papers will be saved to a CSV file for annotation and further analysis.

The selected papers will be saved to a CSV file named '03MICCAI_notebook_100_randomly_selected_papers.csv'.
"""

# Name of the notebook for saving the selected papers
notebook_name = '03MICCAI_notebook_'

# Check if the number of unique titles is at least 100
unique_titles = papers_with_patients['title'].nunique()
if unique_titles < 100:
    print(f"Warning: Only {unique_titles} unique papers found, less than 100.")

# Randomly select 1000 unique titles
selected_titles = papers_with_patients['title'].drop_duplicates().sample(n=min(100, unique_titles), random_state=32)

# Filter the original DataFrame to include only the selected titles
selected_papers_df = papers_with_patients[papers_with_patients['title'].isin(selected_titles)]

# Save selected_papers_df DataFrame with 100 randomly selected papers and their related rows
#selected_papers_df.to_csv(notebook_name + '100_randomly_selected_papers.csv')

# Print the number of unique titles in the selected DataFrame
print(f"Number of unique titles in the selected DataFrame: {selected_papers_df['title'].nunique()}")

Number of unique titles in the selected DataFrame: 100


***
***




### Manual processed papers preivously stored in '00MICCAI_total_outputs'

In [4]:
import pandas as pd

path = '00MICCAI_total_outputs/03MICCAI_all_outputs/'

all_papers = pd.read_csv(path + '03MICCAI_notebook_df_paper_extractions_all_cleaned.csv')
print(f'Total number of unique papers: {all_papers["title"].nunique()}') # 730

cancer_papers = pd.read_csv(path + '03MICCAI_notebook_df_paper_extractions_cancer.csv')
print(f'Total number of unique papers mentioning "cancer": {cancer_papers["title"].nunique()}') # 263

patient_cancer_papers = pd.read_csv(path + '03MICCAI_notebook_df_paper_extractions_patients_and_cancer.csv')
print(f'Total number of unique papers mentioning "patient/patients": {patient_cancer_papers["title"].nunique()}') # 155

rand_selected_papers = pd.read_csv(path + '03MICCAI_notebook_100_randomly_selected_papers.csv')
print(f'Total number of unique papers in the randomly selected 100 papers: {rand_selected_papers["title"].nunique()}') # 100

Total number of unique papers: 730
Total number of unique papers mentioning "cancer": 263
Total number of unique papers mentioning "patient/patients": 155
Total number of unique papers in the randomly selected 100 papers: 100
