#### Objective of the Notebook:
The notebook is designed to automate the extraction and processing of data from HTML documents that contain information about MICCAI 2023 conference papers. It aims to streamline the gathering of details such as titles, authors, DOIs, and page numbers from web-scraped HTML content, organize this information into structured data formats, and ensure its usability for further analysis or reference.

MICCAI 2023 webpage: https://conferences.miccai.org/2023/papers/, listing the 730 submitted papers. 

The 730 research articles are divided into 10 PDF volumes: https://link.springer.com/book/10.1007/978-3-031-43907-0. 

#### Input Data Expected:
1. **HTML Documents**: The input for this notebook consists of HTML files saved as `.doc` files. Each document corresponds to a page from the table of contents of the MICCAI 2023 conference volumes, detailing titles, authors, DOIs, and page numbers of the conference papers.
    - In total there is 41 .doc files containing HTML code: [MICCAI 2023 .doc files with HTML code](<../../../miccai2023 papers>)
2. **Directory Paths**: The paths where these HTML documents are stored need to be provided. The notebook expects a structured directory containing these documents for each volume of the conference proceedings.

#### Output Data/Files Generated:
1. **Dataframes**: The primary output is a Pandas dataframe containing cleaned and structured data extracted from the HTML documents. This dataframe includes columns for paper titles, authors, page numbers, DOIs, publication years, and volume numbers.
2. **CSV File**: The cleaned and consolidated dataframe is saved as a CSV file named `database_miccai_<year>.csv`, where `<year>` represents the year of the conference. This file contains all the information ready for further analysis or archival purposes.

#### Assumptions or Important Notes:
1. **Pre-processed HTML Documents**: It is assumed that the HTML content from the web pages has been manually copied and saved into `.doc` files accurately without any alterations that could affect the structure of the information.
2. **Manual Handling of HTML**: The method of acquiring HTML content involves manual copying and pasting, which may introduce human error or inconsistencies in how data is formatted or saved.
3. **File Naming Convention**: The notebook assumes a specific naming convention for the HTML files, which includes details about the volume and page number. Any deviation from this naming convention might require modifications to the file-processing logic in the notebook.
4. **HTML Structure Dependence**: The extraction functions rely heavily on the structure of the HTML documents being consistent across all files. Changes in the webpage layout or HTML structure in future conferences will require adjustments to the scraping logic.
5. **Exclusion Criteria**: Specific page numbers known to contain irrelevant information (like front and back matter) are programmatically excluded during processing to ensure the cleanliness of the dataset.

# Automated Data Extraction and Processing of MICCAI 2023 Conference Papers
***

Libraries and functions

In [22]:
# pdf webscrapping and text extraction
from bs4 import BeautifulSoup

# get the pdf files or url from the web
import requests

# input output operations
import io

#!pip install xhtml2pdf requests
#!pip install lxml

# for converting html to pdf
import regex as re
import pandas as pd
from urllib.request import urlopen
from urllib import request as urllib2
import numpy as np

Main function for text mining relevant information from MICCAI 2023 HTML files into a dataframe
***


In [57]:
pd.set_option('display.max_rows', None)

In [58]:
#helper function to find lines in html document with DOI and titles of articles
def has_doi(href):
    return href and re.compile("chapter/").search(href)

In [59]:
#main mining function returning the initial dataframe 
def mining(html_doc, year, current_page, all_pages, part):
    #opening the html document (copy pasted and saved as a .doc file)
    doc = open(html_doc, "r", encoding = "ISO-8859-1") 
    soup = BeautifulSoup(doc, 'html.parser' )

    list_of_doi = soup.find_all(href=has_doi)
        
    #getting the titles and the doi's from list generated helper function
    titles = []
    doi_str = []

    for element in list_of_doi:
        titles.append(element.get_text()) #returns the titles as the only text in the list
        string = str(element)
        first_substring = '/chapter'
        second_substring ='">'
        #separates out the DOIS (added the +9 to remove /chapter/ from the beginning of all DOIS)
        doi_str.append(string[(string.find(first_substring)+9):string.find(second_substring)]) 
                
    ## now the lines containing author are found
    authors = soup.find_all("li", class_="c-author-list__item")
        
    #keeping only the author names
    authors_str = []
    for element in authors:
        string = str(element)
        first_substring = 'item">'
        second_substring ='</li>'
        authors_str.append(string[(string.find(first_substring)+6):string.find(second_substring)])

    #now the lines containing page numbers are found
    page_numbers= soup.find_all('div', class_ = "c-meta")

    #keeping only the page numbers
    page_numbers_str = []

    # an element in page_numbers_str looks like this:
    '''' <div class="c-meta"><span class="c-meta__item u-display-inline-block" 
    data-test="page-number"> Pages 618-627</span> </div> '''  

    for element in page_numbers:            
        #removes white spaces, and everything within the div tag that is not the page numbers
        string = element.get_text()[6:-1]                       #618-627
        #splits the string into two numbers
        both = string.split("-")                                #['618', '627']
        #filtering out front matters and back matters                                                   
        if 'x' and '1-1' and '463-463' and '135-135' and '649-649' and '247-247' and '281-281' and '369-369' and '433-433' and '509-509' and '583-583' and '757-757' and '535-535' not in string:             
            try: 
                if int(both[1])-int(both[0]) > 1: 
                    page_numbers_str.append(string)              #if True adds the string to the list

            except:
                if "C1" in string or "C" in string:              #page numbers that are in the form of C<some number> 
                    page_numbers_str.append(string)
        
        #filtering out back matters
        if int(current_page) == int(all_pages) and '781-785' in string: 
            page_numbers_str = page_numbers_str[:-1]
        elif int(current_page) == int(all_pages) and '787-791' in string:
            page_numbers_str = page_numbers_str[:-1]
        elif int(current_page) == int(all_pages) and '767-771' in string:
            page_numbers_str = page_numbers_str[:-1]
        elif int(current_page) == int(all_pages) and '797-801' in string:  
            page_numbers_str = page_numbers_str[:-1]
        elif int(current_page) == int(all_pages) and '801-806' in string:
            page_numbers_str = page_numbers_str[:-1]
        elif int(current_page) == int(all_pages) and '813-818' in string:
            page_numbers_str = page_numbers_str[:-1]
        elif int(current_page) == int(all_pages) and '687-690' in string:
            page_numbers_str = page_numbers_str[:-1]
        elif int(current_page) == int(all_pages) and '739-743' in string:
            page_numbers_str = page_numbers_str[:-1]
        elif int(current_page) == int(all_pages) and '791-795' in string:
            page_numbers_str = page_numbers_str[:-1]


    #need to create a list of the year of publication to add to dataframe 
    year_of_pub = []
    for element in titles:
        year_of_pub.append(year)
    
    #will add the part of the publication to the dataframe as well
    part_of_pub = []
    for element in titles:
        part_of_pub.append(part)
        
    #creating the column names and content for the dataframe        
    data = {'Title': titles,
        'Authors': authors_str,
        'Page numbers' : page_numbers_str,
        'DOI': doi_str,
        'Year of publication' : year_of_pub,
        'Part of publication' : part_of_pub       }

    df = pd.DataFrame(data)

    return df 

In [89]:

def clean_df(df):    
    # Sorting by 'Title' and resetting the index
    df.sort_values(by='Title', inplace=True)
    df.reset_index(drop=True, inplace=True)

    # Adding 'paper_id' as the first column
    df.insert(0, 'paper_id', range(1, len(df) + 1))

    # Rename/refine the columns of the dataframe to make it more accessible for further use and analysis
    df.rename(columns={'Title': 'title', 'Part of publication': 'vol_number', 'Authors': 'authors', 
                   'DOI': 'doi', 'Year of publication': 'publication_year','Page numbers': 'page_numbers'}, inplace=True)   

    # Filtering out rows where 'page_numbers' is 'C1-C1'
    df = df[df['page_numbers'] != 'C1-C1']

    # Direct conversion to ensure 'vol_number' is treated as an integer
    df.loc[:, 'vol_number'] = pd.to_numeric(df['vol_number'], errors='coerce').fillna(0).astype(int)
    
    # Proceed with sorting, cleaning, and other preparation steps as before
    vol_counts = df['vol_number'].value_counts()
    
    # Using a loop to print the number of papers by volume, safely handling missing volumes
    for vol in range(1, 11):
        # Use .get(vol, default_value) to safely access the count for each volume
        print(f"Number of papers in Volume {vol}:", vol_counts.get(vol, 0))


    # Printing the total number of papers
    print('Total number of papers:', len(df))

    return df

In [90]:
#helper function to combine all df together
def data_together(data, year):
    combined_frame = pd.concat(data, ignore_index = True, sort = False)
    combined_frame_refined = clean_df(combined_frame)
    combined_frame_refined.to_csv('/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/databases/database_miccai_'+ str(year) +'.csv')
   
    return combined_frame_refined

In [79]:
"""
miccai 2023 papers : 10 volumes in total 
base path to the folder with stored HTML docs:
"""


miccai =[
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 01 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 01 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 01 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 01 page 4 of 4.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 02 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 02 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 02 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 02 page 4 of 4.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 03 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 03 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 03 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 03 page 4 of 4.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 04 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 04 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 04 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 04 page 4 of 4.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 05 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 05 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 05 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 05 page 4 of 4.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 06 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 06 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 06 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 06 page 4 of 4.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 07 page 1 of 5.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 07 page 2 of 5.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 07 page 3 of 5.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 07 page 4 of 5.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 07 page 5 of 5.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 08 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 08 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 08 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 08 page 4 of 4.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 09 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 09 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 09 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 09 page 4 of 4.doc',

    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 10 page 1 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 10 page 2 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 10 page 3 of 4.doc',
    '/Users/yasminsarkhosh/Documents/miccai2023 papers/miccai2023 vol 10 page 4 of 4.doc',     
    ]

In [88]:
data = []
for element in miccai:
    data.append(mining(element, 2023, element[-10], element[-5],  element[-18:-16]))

data_together(data, 2023)

Number of papers in Volume 1: 73
Number of papers in Volume 2: 73
Number of papers in Volume 3: 72
Number of papers in Volume 4: 75
Number of papers in Volume 5: 76
Number of papers in Volume 6: 77
Number of papers in Volume 7: 75
Number of papers in Volume 8: 65
Number of papers in Volume 9: 70
Number of papers in Volume 10: 74
Total number of papers: 730


Unnamed: 0,paper_id,title,authors,page_numbers,doi,publication_year,vol_number
0,1,3D Arterial Segmentation via Single 2D Project...,"Alina F. Dima, Veronika A. Zimmer, Martin J. M...",141-151,10.1007/978-3-031-43907-0_14,2023,1
1,2,3D Dental Mesh Segmentation Using Semantics-Ba...,"Fan Duan, Li Chen",456-465,10.1007/978-3-031-43990-2_43,2023,7
2,3,3D Medical Image Segmentation with Sparse Anno...,"Heng Cai, Lei Qi, Qian Yu, Yinghuan Shi, Yang Gao",614-624,10.1007/978-3-031-43898-1_59,2023,3
3,4,3D Mitochondria Instance Segmentation with Spa...,"Omkar Thawakar, Rao Muhammad Anwer, Jorma Laak...",613-623,10.1007/978-3-031-43993-3_59,2023,8
4,5,3D Teeth Reconstruction from Panoramic Radiogr...,"Sihwa Park, Seongjun Kim, In-Seok Song, Seung ...",376-386,10.1007/978-3-031-43999-5_36,2023,10
5,6,A Closed-Form Solution to Electromagnetic Sens...,"Tiancheng Li, Yang Song, Peter Walker, Kai Pan...",365-375,10.1007/978-3-031-43996-4_35,2023,9
6,7,A Conditional Flow Variational Autoencoder for...,"Haoran Dou, Nishant Ravikumar, Alejandro F. Fr...",143-152,10.1007/978-3-031-43990-2_14,2023,7
7,8,A Coupled-Mechanisms Modelling Framework for N...,"Tiantian He, Elinor Thompson, Anna Schroder, N...",459-469,10.1007/978-3-031-43993-3_45,2023,8
8,9,A Denoised Mean Teacher for Domain Adaptive Po...,"Alexander Bigalke, Mattias P. Heinrich",666-676,10.1007/978-3-031-43999-5_63,2023,10
9,10,A Flexible Framework for Simulating and Evalua...,"Emma A. M. Stanley, Matthias Wilms, Nils D. Fo...",489-499,10.1007/978-3-031-43895-0_46,2023,2
