### Objective of the Notebook:
The notebook is designed for preprocessing and extracting structured information from a collection of MICCAI 2023 XML documents. This involves transforming the raw XML data into clean, organized formats suitable for analysis and further processing. Key operations include parsing XML for headers and related text, identifying specific content like cancer-related papers, and extracting meaningful insights through structured dataframes.

### Input Data Expected:
The input for this notebook includes:
- XML documents from MICCAI 2023, located at a predefined directory on the user's system.
- These XML documents should already be formatted correctly and accessible to the script for reading and processing.

### Output Data/Files Generated:
Outputs generated by this notebook include:
- Structured CSV files or dataframes containing parsed and cleaned information from the XML documents.
- A CSV file specifically aggregating cancer-related papers extracted from the XML data.
- Various CSV files based on further refined categorizations (like patient mentions in the cancer-related papers).

### Assumptions or Important Notes:
- The GROBID client must be running on the local machine via terminal to process the documents. This client is essential for extracting text from the XML documents.
- XML documents are expected to follow a consistent structure that allows for automated parsing by the script.
- Manual interventions were necessary for file renaming based on titles extracted from XML, implying the script assumes that each document can be uniquely identified by its title.
- The script assumes the presence of specific headers and textual formats within the XML files for accurate parsing.
- Error handling is minimal, so XML files should not contain corrupt data or unexpected formatting errors.

# Old code
***

***
#### Merge all volumes into 1 dataframe
***

In [None]:
# for vol, df in cleaned_dataframes.items():
#     volume_number = int(re.search(r'\d+', vol).group())
#     # Append volume information to each title to ensure uniqueness across volumes
#     df['Paper Title'] = df['Paper Title'].astype(str) + ' (vol' + str(volume_number) + ')'
#     df['Volume'] = volume_number

# combined_df = pd.concat(cleaned_dataframes.values(), ignore_index=True)
# # Check unique titles after appending volume information
# unique_titles_count = len(combined_df['Paper Title'].unique())
# print(f"Total unique titles after enhancement: {unique_titles_count}")

In [None]:
combined_df = combined_df.fillna(0)
#combined_df.to_csv('combined_df.csv')

# Lowercase all text in the 'Text' column
combined_df['Text'] = combined_df['Text'].str.lower()
combined_df['Text'] = combined_df['Text'].apply(wrap_text, width = 80)

# Regular expression with str.replace to remove the volume information
combined_df['Paper Title'] = combined_df['Paper Title'].str.replace(r'\s*\(vol\d+\)', '', regex=True)

combined_df.rename(columns={'Paper Title': 'title', 'Header Number':'header_no', 
                                           'Header Title': 'header_title', 'Text':'text', 'Volume': 'volume'}, inplace=True)

#combined_df.to_csv('refined_all_papers_extracted_w_text.csv')

***
***

In [None]:
#filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/outputs/databases/refined_all_papers_extracted_w_text.csv'
filename = "/Users/yasminsarkhosh/Desktop/all_papers_w_extracted_text.csv"
combined_df = pd.read_csv(filename, index_col=0)
print(len(combined_df['title'].unique()))
combined_df

### Select papers related to cancer
***

In [None]:
# Search for 'cancer' in the Text column, case insensitive
cancer_papers_mask = combined_df['text'].str.contains('cancer|tumor|tumour', case=False, na=False)
papers_with_cancer = combined_df[cancer_papers_mask]

# Get the unique titles of papers that mention 'cancer'
unique_titles_with_cancer = papers_with_cancer['title'].unique()

# Extract all headers and their related text for papers that mention 'cancer'
extracted_info = pd.DataFrame()
for title in unique_titles_with_cancer:
    paper_info = combined_df[combined_df['title'] == title]
    extracted_info = pd.concat([extracted_info, paper_info])

# Reset index of the resulting DataFrame
extracted_info.reset_index(drop=True, inplace=True)


unique_paper_titles_with_cancer = extracted_info['title'].unique()
print(len(unique_paper_titles_with_cancer))

# Display the first few rows of the resulting DataFrame
extracted_info
extracted_info.to_csv("cancer_related_papers_w_text.csv", index=False)


# Automated Extraction and Analysis of MICCAI 2023 XML Documents
***

In [1]:
# Libraries 
import os
import re
import pandas as pd

# The GROBID client to process and extract information from XML files
from grobid_client.grobid_client import GrobidClient

# For parsing XML files
from xml.etree import ElementTree as et

1. Text wrapping utility function that wrappes text lines within a dataframe row
***

In [2]:
def wrap_text(text, width=80):
    if pd.isnull(text):
        return text
    
    wrapped_lines = []
    for paragraph in text.split('\n'):
        line = ''
        for word in paragraph.split():
            if len(line) + len(word) + 1 > width:
                wrapped_lines.append(line)
                line = word
            else:
                line += ' ' + word if line else word
        wrapped_lines.append(line)
    return '\n'.join(wrapped_lines)

2. Interacting with the GROBID client for processing documents 
***

In [3]:
def process_fulltext_document(client, process_file, output_dir):
    client.process('processFulltextDocument', process_file, output=output_dir, force=True)

3. Renaming MICCAI 2023 XML files based on their individual titles
***

In [4]:
# Rename XML files in a folder to the title of the paper
def rename_xml_files_in_folder(folder_path):
    for filename in os.listdir(folder_path):
        if filename.endswith('.xml'):
            file_path = os.path.join(folder_path, filename)
            try:
                tree = et.parse(file_path)
                root = tree.getroot()
                paper_title = find_title(root)
                if paper_title:
                    new_filename = paper_title.replace(" ", "_") + '.xml'
                    new_file_path = os.path.join(folder_path, new_filename)
                    os.rename(file_path, new_file_path)
            except et.ParseError as e:
                print(f"Error parsing '{filename}': {e}")

def find_title(element):
    if 'title' in element.tag.lower() and element.text:
        return element.text.strip()
    for child in element:
        title = find_title(child)
        if title:
            return title
    return None

4. Parsing XML files to extract headers and related text into a dataframe
***

In [5]:

def parse_xml_and_extract_headers(file_path):
    tree = etree.parse(file_path)
    root = tree.getroot()
    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

    # Extract the paper title by XPath in the XML's structure
    paper_title_element = root.find('.//tei:title', ns)
    paper_title = paper_title_element.text if paper_title_element is not None else "No Title Found"

    headers = root.xpath('//tei:head', namespaces=ns)
    print(f"Found {len(headers)} headers in '{paper_title}'")
    
    data = []
    for header in headers:
        # Use XPath string() function to get all text within the <p> tags, including nested elements
        text_content = ''.join(header.getparent().xpath('.//tei:p//text()', namespaces=ns))
        data.append({
            'Paper Title': paper_title,
            'Header Number': header.get('n'),
            'Header Title': header.text,
            'Text': text_content  # Updated to use text_content
        })

    df = pd.DataFrame(data, columns=['Paper Title', 'Header Number', 'Header Title', 'Text'])
    return df

5. Aggregating and cleaning data from multiple sources.
***

In [6]:
def process_xml_folder(folder_path):
    # Aggregates data from multiple XML files in a given folder
    all_data_frames = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith(".xml"):
            file_path = os.path.join(folder_path, file_name)
            df = parse_xml_and_extract_headers(file_path)
            all_data_frames.append(df)

    if all_data_frames:
        final_df = pd.concat(all_data_frames, ignore_index=True)
    else:
        final_df = pd.DataFrame()

    return final_df

# Clean dataframe from duplicates 
def clean_dataframe(df):
    df_cleaned = df.dropna(subset=['Header Title', 'Text'], how='all').dropna(subset=['Text'], how='any')
    return df_cleaned

# Merge each dataframe (where 1 dataframe contains all information from volume 1 etc.) into a final dataframe with all papers from MICCAI 2023
def merge_dataframes(cleaned_dataframes):
    for vol, df in cleaned_dataframes.items():
        volume_number = int(re.search(r'\d+', vol).group())
        df['Paper Title'] += f' (vol{volume_number})'
        df['Volume'] = volume_number
    return pd.concat(cleaned_dataframes.values(), ignore_index=True)

In [7]:
# Run GROBID in terminal before running the notebook
# Installation and running commands
# wget https://github.com/kermitt2/grobid/archive/0.8.0.zip
# unzip 0.8.0.zip
# cd grobid-0.8.0
# ./gradlew run

# GROBID library: Runs in terminal
grobid_server = 'http://localhost:8070'
client = GrobidClient(grobid_server=grobid_server)

# MICCAI 2023 PDF files, organised by volumes into separate folders
# 730 PDFs in total, divided into 10 folders
process_file = '/Users/yasminsarkhosh/Documents/archive/miccai_papers/vol1'
#process_file = '../miccai_papers'
# Output folder for processed PDF files as XML files
#output_dir = '../miccai_XML_documents'
output_dir = "/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/03MICCAI_notebook_outputs"

# Call the function to start the preprocessing and data extraction 
process_fulltext_document(client, process_file, output_dir)


GROBID server is up and running


In [None]:
#combined_df = process_fulltext_document(client, process_file, output_dir)
#combined_df.to_csv('03MICCAI_notebook_all_papers_w_extracted_text.csv')

### **Scope of papers #1: cancer-related medical AI's**
***

6. Selection of papers: identifying cancer-related keywords in papers within the aggregated data.
***

In [21]:
notebook_name = "03MICCAI_notebook_"
filename = "03MICCAI_notebook_all_papers_w_extracted_text.csv"
combined_df = pd.read_csv(filename, index_col=0)
print(len(combined_df['title'].unique()))
combined_df

730


Unnamed: 0,title,header_no,header_title,text,volume
0,AMAE: Adaptation of Pre-trained Masked Autoenc...,1.0,Introduction,to reduce radiologists' reading burden and mak...,1
1,AMAE: Adaptation of Pre-trained Masked Autoenc...,2.0,Method,notation. we first formally define the problem...,1
2,AMAE: Adaptation of Pre-trained Masked Autoenc...,2.1,Stage 1-Proxy Task to Detect Synthetic Anomalies,amae starts the first training stage using onl...,1
3,AMAE: Adaptation of Pre-trained Masked Autoenc...,2.2,Stage 2-MAE Inter-Discrepancy Adaptation,the proposed mae adaptation scheme is inspired...,1
4,AMAE: Adaptation of Pre-trained Masked Autoenc...,3.0,Experiments,datasets. we evaluated our method on three pub...,1
...,...,...,...,...,...
6963,LightNeuS: Neural Surface Reconstruction in En...,3.1,Using Illumination Decline as a Depth Cue,the neus formulation of sect. 2 assumes distan...,10
6964,LightNeuS: Neural Surface Reconstruction in En...,3.2,Endoscope Photometric Model,"apart from illumination decline, there are sev...",10
6965,LightNeuS: Neural Surface Reconstruction in En...,4.0,Experiments,we validate our method on the c3vd dataset [4]...,10
6966,LightNeuS: Neural Surface Reconstruction in En...,5.0,Conclusion,we have presented a method for 3d dense multi-...,10


In [22]:
# Search for 'cancer, tumor, tumour' in the text column, case insensitive
cancer_papers_mask = combined_df['text'].str.contains('cancer|tumor|tumour', case=False, na=False)
papers_with_cancer = combined_df[cancer_papers_mask]

# Get the unique titles of papers that mention 'cancer'
unique_titles_with_cancer = papers_with_cancer['title'].unique()

# Extract all headers and their related text for papers that mention 'cancer'
extracted_info = pd.DataFrame()
for title in unique_titles_with_cancer:
    paper_info = combined_df[combined_df['title'] == title]
    extracted_info = pd.concat([extracted_info, paper_info])

# Reset index of the resulting DataFrame
extracted_info.reset_index(drop=True, inplace=True)

unique_paper_titles_with_cancer = extracted_info['title'].unique()
print(len(unique_paper_titles_with_cancer))

# Save the extracted information to a CSV file
# extracted_info.to_csv(notebook_name + 'cancer_related_papers_w_text.csv')

263


### **Scope of papers #2: Cancer-related medical AI's wording 'patients' in their research articles**
***

As an experiment, I have narrowed down the 263 cancer-related medical AI's papers down to a scope of papers, working with datasets with a subgroup defined as 'patient/patients'. Mindful, that this code only selects papers by keyword-match.

The total number of papers are now down to 155.

In [24]:
categories = {
    'age': ['age', 'age', 'young', 'old', 'gender'],
    'gender': ['gender', 'sex', 'women', 'woman', 'female', 'male'],
    'ethnicity': ['ethnicity', 'ethnicities', 'race', 'white patients', 'black patients'],
    'location_info': ['geolocation', 'geographical', 'geographic', 'country', 'countries', 
                    'city', 'cities', 'hospital', 'hospitals', 'clinic', 'clinics', 'continent',
                    'province', 'state', 'region', 'town', 'village', 'area', 'district'],
    'patients': ['patient', 'patients'],
    'dataset_info': ['dataset', 'datasets', 'data set', 'data sets', 'publicly', 'public', 'private', 'open access', 'open-access'],
    'bias_info': ['bias', 'biases', 'fairness'],
    
}

# Flatten the list of all keywords excluding 'patients' to avoid redundancy
all_keywords = sum([kw for cat, kw in categories.items() if cat != 'patients'], [])

# Filter papers that mention 'patient' or 'patients'
scope_mask = extracted_info['text'].str.contains('patient|patients', case=False, na=False)
papers_with_patients = extracted_info[scope_mask]

# Prepare a list to collect paper info dictionaries
papers_info_list = []

# Iterate over unique titles in the filtered DataFrame
for title in papers_with_patients['title'].unique():
    paper_info = papers_with_patients[papers_with_patients['title'] == title]
    # Initialize a dictionary for the current paper with zeros for all keywords
    paper_keywords = dict.fromkeys(all_keywords, 0)
    paper_keywords['title'] = title
    # Check for each keyword in the text of the paper
    for keyword in all_keywords:
        if any(paper_info['text'].str.contains(keyword, case=False, na=False)):
            paper_keywords[keyword] = 1
    # Collect the keyword matches for the current paper
    papers_info_list.append(paper_keywords)

# Create a DataFrame from the list of dictionaries
keywords_per_paper = pd.DataFrame(papers_info_list)

# Display or work with the keywords_per_paper DataFrame
#keywords_per_paper.to_csv('keywords_per_paper.csv')
#papers_with_patients.to_csv(notebook_name + 'papers_with_patients.csv')

print('Number of unique titles for papers containing the keyword <patient/patients>:', len(papers_with_patients['title'].unique()))

Number of unique titles for papers containing the keyword <patient/patients>: 155


# Old code
***

***
#### Merge all volumes into 1 dataframe
***

In [None]:
# for vol, df in cleaned_dataframes.items():
#     volume_number = int(re.search(r'\d+', vol).group())
#     # Append volume information to each title to ensure uniqueness across volumes
#     df['Paper Title'] = df['Paper Title'].astype(str) + ' (vol' + str(volume_number) + ')'
#     df['Volume'] = volume_number

# combined_df = pd.concat(cleaned_dataframes.values(), ignore_index=True)
# # Check unique titles after appending volume information
# unique_titles_count = len(combined_df['Paper Title'].unique())
# print(f"Total unique titles after enhancement: {unique_titles_count}")

In [None]:
combined_df = combined_df.fillna(0)
#combined_df.to_csv('combined_df.csv')

# Lowercase all text in the 'Text' column
combined_df['Text'] = combined_df['Text'].str.lower()
combined_df['Text'] = combined_df['Text'].apply(wrap_text, width = 80)

# Regular expression with str.replace to remove the volume information
combined_df['Paper Title'] = combined_df['Paper Title'].str.replace(r'\s*\(vol\d+\)', '', regex=True)

combined_df.rename(columns={'Paper Title': 'title', 'Header Number':'header_no', 
                                           'Header Title': 'header_title', 'Text':'text', 'Volume': 'volume'}, inplace=True)

#combined_df.to_csv('refined_all_papers_extracted_w_text.csv')

***
***

In [None]:
#filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/outputs/databases/refined_all_papers_extracted_w_text.csv'
filename = "/Users/yasminsarkhosh/Desktop/all_papers_w_extracted_text.csv"
combined_df = pd.read_csv(filename, index_col=0)
print(len(combined_df['title'].unique()))
combined_df

### Select papers related to cancer
***

In [None]:
# Search for 'cancer' in the Text column, case insensitive
cancer_papers_mask = combined_df['text'].str.contains('cancer|tumor|tumour', case=False, na=False)
papers_with_cancer = combined_df[cancer_papers_mask]

# Get the unique titles of papers that mention 'cancer'
unique_titles_with_cancer = papers_with_cancer['title'].unique()

# Extract all headers and their related text for papers that mention 'cancer'
extracted_info = pd.DataFrame()
for title in unique_titles_with_cancer:
    paper_info = combined_df[combined_df['title'] == title]
    extracted_info = pd.concat([extracted_info, paper_info])

# Reset index of the resulting DataFrame
extracted_info.reset_index(drop=True, inplace=True)


unique_paper_titles_with_cancer = extracted_info['title'].unique()
print(len(unique_paper_titles_with_cancer))

# Display the first few rows of the resulting DataFrame
extracted_info
extracted_info.to_csv("cancer_related_papers_w_text.csv", index=False)


In [None]:
# Old code
#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/vol1'
#client.process('processFulltextDocument', process_file, output="./vol01", force=True)

#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol2'
#client.process('processFulltextDocument', process_file, output = "./vol02", force=True)

#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol3'
#client.process('processFulltextDocument', process_file, output="./vol03", force=True)

#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol4'
#client.process('processFulltextDocument', process_file, output="./vol04", force=True)

# process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol5'
# client.process('processFulltextDocument', process_file, output="./vol05", force=True)

#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol6'
#client.process('processFulltextDocument', process_file, output="./vol06", force=True)

#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol7'
#client.process('processFulltextDocument', process_file, output="./vol07", force=True)

#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol8'
#client.process('processFulltextDocument', process_file, output="./vol08", force=True)

#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol9'
#client.process('processFulltextDocument', process_file, output="./vol09", force=True)

#process_file = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol10'
#client.process('processFulltextDocument', process_file, output="./vol10", force=True)  

# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol01'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol02'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol03'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol04'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol05'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol06'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol07'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol08'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol09'
# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol10'


#process_file = base_path + 'vol2'
#client.process('processFulltextDocument', process_file, output = "./vol02", force=True)

#process_file = base_path + 'vol3'
#client.process('processFulltextDocument', process_file, output="./vol03", force=True)

#process_file = base_path + 'vol4'
#client.process('processFulltextDocument', process_file, output="./vol04", force=True)

# process_file = base_path + 'vol5'
# client.process('processFulltextDocument', process_file, output="./vol05", force=True)

#process_file = base_path + 'vol6'
#client.process('processFulltextDocument', process_file, output="./vol06", force=True)

#process_file = base_path + 'vol7'
#client.process('processFulltextDocument', process_file, output="./vol07", force=True)

#process_file = base_path + 'vol8'
#client.process('processFulltextDocument', process_file, output="./vol08", force=True)

#process_file = base_path + 'vol9'
#client.process('processFulltextDocument', process_file, output="./vol09", force=True)

#process_file = base_path + 'vol10'
#client.process('processFulltextDocument', process_file, output="./vol10", force=True)  

# folder_path = '../vol01'
# folder_path = '../vol02'
# folder_path = '../vol03'
# folder_path = '../vol04'
# folder_path = '../vol05'
# folder_path = '../vol06'
# folder_path = '../vol07'
# folder_path = '../vol08'
# folder_path = '../vol09'
# folder_path = '../vol10'
# # The collection of MICCAI 2023 papers is stored in the following directory structure:
# # ../miccai_papers/vol1
# # ../miccai_papers/vol2
# # ...
# # ../miccai_papers/vol10

# ''' 
# For each subfolder in the main folder_path directory of the MICCAI 2023 papers, 
# the GRONBID client is used to process the full text of the papers by converting the PDF files to XML files.
# '''

# base_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/project_submission/00_project_notebook/miccai_papers'

# for folder in os.listdir(base_path):
#     process_file = base_path + folder
#     client.process('processFulltextDocument', process_file, output=f"./processed_vol{folder}", force=True)
#     print(f"Processing {folder}...")
#     print(f"Output: {f'./{folder}'}")
#     print(f"Force: True")

# folder_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/vol01/vol1'
# rename_xml_files_in_folder(folder_path)


# Manually renaming the XML files based on their title tags

# Load and parse the XML file
#file_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol09/paper_59.grobid.tei.xml'
#file_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/output2/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'

''' Paper 44 XML file: title was too long to be used as a file name, removed '/CT Self-supervised Denoising' from the title '''
#file_path ='/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/paper_44.grobid.tei.xml'

#file_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/output2/paper_13.grobid.tei.xml'

'''FileNotFoundError: [Errno 2] No such file or directory: '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/paper_49.grobid.tei.xml' -> 
'/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/A_Patient-Specific_Self-supervised_Model_for_Automatic_X-Ray/CT_Registration.xml'
Solution: removed '/CT_Registration' from title
'''
#file_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/paper_49.grobid.tei.xml'
#file_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/papers_xml/vol04/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
#file_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/papers_xml/vol01/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
file_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/papers_xml/vol02/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'

tree = et.parse(file_path)
root = tree.getroot()

# Since XML namespaces can complicate direct tag access, we find the title tag dynamically.
# This approach is based on the assumption that titles are relatively unique in structure.

# Attempt to extract the paper title. This might need adjustments based on the actual structure.
title = None
for elem in root.iter():
    if 'title' in elem.tag.lower():
        title = elem.text
        break

title_clean = title.strip().replace(" ", "_") if title else "Untitled_Document"
title_clean

# Attempt a more generic search for the title, considering common patterns in scholarly articles
# We'll look for title elements that might be nested within other elements (like "titleStmt" or "fileDesc" in TEI format)

def find_title(element):
    """
    Recursively search for the title element in the XML structure.
    """
    if 'title' in element.tag.lower() and element.text:
        return element.text.strip()
    for child in element:
        title = find_title(child)
        if title:
            return title
    return None

# Attempt to find the title using the recursive search
paper_title = find_title(root)
paper_title_clean = paper_title.replace(" ", "_") if paper_title else "Untitled_Document"
paper_title, paper_title_clean

import os

# Define the new file path with the clean title
new_file_path = os.path.join(os.path.dirname(file_path), f"{paper_title_clean}.xml")

# Rename the file
os.rename(file_path, new_file_path)

new_file_path

# Manually renaming the XML files that are named "Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml" 

path = "../03MICCAI_notebook_GROBID_processed_volumes"

# Load and parse the XML file
# file_path = path + '/vol1/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol2/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = '/vol4/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = '/vol6/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = '/vol7/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = '/vol9/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'

# path = "03MICCAI_notebook_GROBID_processed_volumes"

# Load and parse the XML file
# file_path = path + '/vol1/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol2/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol4/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol6/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol7/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'
# file_path = path + '/vol9/Medical_Image_Computing_and_Computer_Assisted_Intervention_–_MICCAI_2023.xml'

# df_headers = process_xml_folder(folder_path)
# df_headers.to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/vol1_headers.csv", index=False)
# #df_headers

# cleaned_dataframes['vol1'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol1_cleaned.csv", index=False)
# cleaned_dataframes['vol2'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol2_cleaned.csv", index=False)
# cleaned_dataframes['vol3'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol3_cleaned.csv", index=False)
# cleaned_dataframes['vol4'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol4_cleaned.csv", index=False)
# cleaned_dataframes['vol5'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol5_cleaned.csv", index=False)
# cleaned_dataframes['vol6'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol6_cleaned.csv", index=False)
# cleaned_dataframes['vol7'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol7_cleaned.csv", index=False)
# cleaned_dataframes['vol8'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol8_cleaned.csv", index=False)
# cleaned_dataframes['vol9'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol9_cleaned.csv", index=False)
# cleaned_dataframes['vol10'].to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/dfs/vol10_cleaned.csv", index=False)

# # Loop over the volume directories
# for i in range(1, 11):
#     vol_path = os.path.join(base_path, f'/03MICCAI_notebook_df_vol{str(i)}_headers.csv')
#     df_headers = process_xml_folder(vol_path)
#     df_cleaned = clean_dataframe(df_headers)
    
#     # Store the cleaned DataFrame in the dictionary with the volume number as the key
#     cleaned_dataframes[f'vol{str(i)}'] = df_cleaned

# Dictionary with all cleaned DataFrames
# cleaned_dataframes['vol1'], cleaned_dataframes['vol2']

# print('total papers in 1:', len(cleaned_dataframes['vol1']['Paper Title'].unique())) # 73
# print('total papers in 2:', len(cleaned_dataframes['vol2']['Paper Title'].unique())) # 73  
# print('total papers in 3:', len(cleaned_dataframes['vol3']['Paper Title'].unique())) # 72 
# print('total papers in 4:', len(cleaned_dataframes['vol4']['Paper Title'].unique())) # 75 
# print('total papers in 5:', len(cleaned_dataframes['vol5']['Paper Title'].unique())) # 76 
# print('total papers in 6:', len(cleaned_dataframes['vol6']['Paper Title'].unique())) # 77 
# print('total papers in 7:', len(cleaned_dataframes['vol7']['Paper Title'].unique())) # 75 
# print('total papers in 8:', len(cleaned_dataframes['vol8']['Paper Title'].unique())) # 65 
# print('total papers in 9:', len(cleaned_dataframes['vol9']['Paper Title'].unique())) # 70
# print('total papers in 10:', len(cleaned_dataframes['vol10']['Paper Title'].unique())) # 74

