# Splitting volumes of MICCAI 2023 papers into separate articles
***

The purpose of this notebook is to split the collection of PDFs from the MICCAI 2023 webpage: https://conferences.miccai.org/2023/papers/, listing the 730 submitted papers. The 730 research articles were divided into 10 PDF volumes: https://link.springer.com/book/10.1007/978-3-031-43907-0. 

The 10 PDFs was downloaded into a folder and named after volume name: '/Documents/bachelor thesis data /research papers/micca2023/part<number of volume>.pdf'

After some experiments, each PDF were manually processed by deteting the first 40 pages in Adobe Reader, listing preface and table of contents, meaning the first page in each PDF starts with the first article. The remaining 4 pages of each PDF was also manully deleted. 

1. Abstract the process of extracting page numbers from PDFs into a function.
2. Generalize the splitting of PDFs into separate research papers into a single function.
3. Manage PDF paths and page numbers in a more dynamic way, possibly using dictionaries or lists to iterate through volumes.

In [1]:
import os
import fitz  # PyMuPDF
import PyPDF2

In [15]:
def get_page_numbers(pdf_path):
    """
    Extract start and end page numbers for each article in a given PDF.

    Parameters:
    - pdf_path: Path to the PDF file.

    Returns:
    - A list of tuples with (start_page, end_page) for each article.
    """
    doc = fitz.open(pdf_path)
    print("Number of pages:", doc.page_count)
    
    articles = []
    current_article_start = 0
    i = 0

    while i < len(doc) - 1:  # Iterate through the document
        page = doc.load_page(i)
        text = page.get_text()

        if "References" in text:
            # Look ahead to see if the next pages are still part of references
            while i < len(doc) - 1:
                next_page = doc.load_page(i+1)
                next_text = next_page.get_text()
                if not is_references_end(text, next_text):
                    i += 1
                    text = next_text
                else:
                    break
            articles.append((current_article_start, i))
            current_article_start = i + 1
        i += 1

    # Handle last article if it ends with the last page
    if current_article_start < len(doc):
        articles.append((current_article_start, len(doc)))
    
    return print('Page numbers', articles if articles else "No articles found")


In [16]:
def split_pdfs(original_pdf_path, papers_pages, output_folder, volume):
    """
    Split the given PDF into separate PDF files based on provided page numbers,
    and save them into a volume-specific folder. Returns the total number of papers processed.

    Parameters:
    - original_pdf_path: Path to the original PDF file.
    - papers_pages: A list of tuples containing the start and end pages for each article.
    - output_folder: The folder where split PDFs will be saved.
    - volume: The volume number or identifier to create a specific folder for the volume.

    Returns:
    - int: The total number of papers processed.
    """
    volume_folder = os.path.join(output_folder, volume)
    os.makedirs(volume_folder, exist_ok=True)  # Create volume folder if it doesn't exist
    
    total_papers = 0  # Initialize counter for total papers processed
    
    with open(original_pdf_path, "rb") as infile:
        reader = PyPDF2.PdfReader(infile)
        
        for i, (start, end) in enumerate(papers_pages):
            writer = PyPDF2.PdfWriter()
            
            for page in range(start, end + 1):
                try:
                    writer.add_page(reader.pages[page])
                except IndexError:
                    print(f"Page {page} out of range for {original_pdf_path}.")
                    break
            
            output_pdf_path = os.path.join(volume_folder, f"paper_{i+1}.pdf")
            with open(output_pdf_path, "wb") as outfile:
                writer.write(outfile)
            total_papers += 1  # Increment the counter after processing each paper

    return total_papers

In [18]:
# Folder path to the 10 PDF volumes
miccai_pdf_path = '/Users/yasminsarkhosh/Documents/micca2023_volumes/'

# Paths to all 10 volumes
pdf_paths = {
    "vol1": miccai_pdf_path + "part01.pdf",
    "vol2": miccai_pdf_path + "part02.pdf",
    "vol3": miccai_pdf_path + "part03.pdf",
    "vol4": miccai_pdf_path + "part04.pdf",
    "vol5": miccai_pdf_path + "part05.pdf",
    "vol6": miccai_pdf_path + "part06.pdf",
    "vol7": miccai_pdf_path + "part07.pdf",
    "vol8": miccai_pdf_path + "part08.pdf",
    "vol9": miccai_pdf_path + "part09.pdf",
    "vol10": miccai_pdf_path + "part10.pdf",
}

In [13]:
""" 
Page numbers were off by at least 40 pages due to preface and contents pages

Solved: 
- By manually deleting the 40+ pages leading up to the first paper 
- Repeated for all 10 PDF volumes 
- The last 4 pages stating Author index was also removed manually
"""

# Page ranges for all volumes extract and added into a dictionary
papers_pages = {
    "vol1": [(0, 9), (10, 20), (21, 31), (32, 42), (43, 53), (54, 63), (64, 73), (74, 83), (84, 94), (95, 105), (106, 115), 
            (116, 126), (127, 137), (138, 148), (149, 158), (159, 169), (170, 179), (180, 191), (192, 202), (203, 212), 
            (213, 223), (224, 234), (235, 244), (245, 255), (256, 266), (267, 275), (276, 286), (287, 296), (297, 306), 
            (307, 316), (317, 327), (328, 338), (339, 348), (349, 359), (360, 370), (371, 380), (381, 390), (391, 401), 
            (402, 412), (413, 421), (422, 431), (432, 442), (443, 453), (454, 463), (464, 473), (474, 482), (483, 493), 
            (494, 504), (505, 514), (515, 524), (525, 535), (536, 547), (548, 558), (559, 569), (570, 579), (580, 590), 
            (591, 601), (602, 611), (612, 622), (623, 633), (634, 645), (646, 657), (658, 668), (669, 678), (679, 689), 
            (690, 700), (701, 711), (712, 722), (723, 733), (734, 743), (744, 753), (754, 764), (765, 775)],

    "vol2": [(0, 10), (11, 21), (22, 31), (32, 42), (43, 53), (54, 64), (65, 75), (76, 86), (87, 97), (98, 108), 
            (109, 119), (120, 131), (132, 142), (143, 153), (154, 164), (165, 175), (176, 186), (187, 196), (197, 206), 
            (207, 216), (217, 227), (228, 238), (239, 249), (250, 260), (261, 270), (271, 271), (272, 281), (282, 291), 
            (292, 301), (302, 312), (313, 323), (324, 333), (334, 344), (345, 355), (356, 365), (366, 376), (377, 388), 
            (389, 397), (398, 406), (407, 419), (420, 429), (430, 440), (441, 451), (452, 462), (463, 473), (474, 483), 
            (484, 494), (495, 505), (506, 515), (516, 526), (527, 537), (538, 548), (549, 558), (559, 569), (570, 580), 
            (581, 590), (591, 601), (602, 612), (613, 622), (623, 633), (634, 643), (644, 653), (654, 665), (666, 675), 
            (676, 686), (687, 697), (698, 707), (708, 717), (718, 728), (729, 738), (739, 748), (749, 759), (760, 770),
            (771, 781)],

    "vol3": [(0, 10), (11, 21), (22, 32), (33, 43), (44, 53), (54, 62), (63, 73), (74, 83), (84, 93), (94, 104), (105, 114), 
            (115, 124), (125, 134), (135, 145), (146, 155), (156, 165), (166, 175), (176, 185), (186, 195), (196, 206), 
            (207, 217), (218, 228), (229, 238), (239, 248), (249, 258), (259, 269), (270, 280), (281, 291), (292, 301), 
            (302, 312), (313, 322), (323, 331), (332, 342), (343, 353), (354, 363), (364, 374), (375, 385), (386, 396), 
            (397, 407), (408, 418), (419, 428), (429, 438), (439, 449), (450, 460), (461, 470), (471, 481), (482, 492), 
            (493, 503), (504, 514), (515, 524), (525, 535), (536, 546), (547, 556), (557, 567), (568, 577), (578, 587), 
            (588, 598), (599, 609), (610, 620), (621, 632), (633, 643), (644, 654), (655, 665), (666, 676), (677, 687), 
            (688, 698), (699, 709), (710, 720), (721, 731), (732, 742), (743, 752), (753, 762)],

    "vol4": [(0, 10), (11, 20), (21, 31), (32, 42), (43, 52), (53, 63), (64, 74), (75, 85), (86, 96), (97, 106), (107, 116), 
            (117, 126), (127, 137), (138, 148), (149, 159), (160, 169), (170, 179), (180, 190), (191, 201), (202, 212), 
            (213, 223), (224, 234), (235, 245), (246, 256), (257, 267), (268, 277), (278, 288), (289, 298), (299, 308), 
            (309, 319), (320, 329), (330, 340), (341, 350), (351, 360), (361, 370), (371, 381), (382, 391), (392, 401), 
            (402, 412), (413, 423), (424, 433), (434, 444), (445, 455), (456, 466), (467, 477), (478, 487), (488, 498), 
            (499, 509), (510, 519), (520, 530), (531, 540), (541, 551), (552, 563), (564, 574), (575, 585), (586, 596), 
            (597, 607), (608, 618), (619, 628), (629, 638), (639, 648), (649, 658), (659, 669), (670, 678), (679, 688), 
            (689, 699), (700, 709), (710, 720), (721, 730), (731, 741), (742, 752), (753, 762), (763, 772), (773, 782), 
            (783, 792)], 


    "vol5": [(0, 9), (10, 19), (20, 29), (30, 40), (41, 51), (52, 59), (60, 68), (69, 79), (80, 90), (91, 102), (103, 112), 
            (113, 123), (124, 132), (133, 142), (143, 153), (154, 164), (165, 174), (175, 185), (186, 195), (196, 206), 
            (207, 216), (217, 226), (227, 237), (238, 248), (249, 258), (259, 268), (269, 279), (280, 289), (290, 300), 
            (301, 311), (312, 323), (324, 334), (335, 344), (345, 354), (355, 364), (365, 375), (376, 385), (386, 395), 
            (396, 405), (406, 416), (417, 426), (427, 437), (438, 448), (449, 458), (459, 468), (469, 478), (479, 489), 
            (490, 500), (501, 511), (512, 522), (523, 533), (534, 543), (544, 554), (555, 565), (566, 575), (576, 586), 
            (587, 597), (598, 607), (608, 618), (619, 628), (629, 638), (639, 648), (649, 659), (660, 670), (671, 681), 
            (682, 692), (693, 702), (703, 711), (712, 722), (723, 733), (734, 743), (744, 754), (755, 764), (765, 774), 
            (775, 785), (786, 802)], 

    "vol6": [(0, 10), (11, 20), (21, 30), (31, 40), (41, 50), (51, 60), (61, 71), (72, 81), (82, 91), (92, 102), (103, 112), 
            (113, 123), (124, 133), (134, 144), (145, 155), (156, 165), (166, 176), (177, 187), (188, 198), (199, 209), 
            (210, 219), (220, 229), (230, 239), (240, 249), (250, 259), (260, 269), (270, 280), (281, 291), (292, 302), 
            (303, 313), (314, 323), (324, 333), (334, 343), (344, 354), (355, 364), (365, 375), (376, 385), (386, 396), 
            (397, 407), (408, 417), (418, 429), (430, 440), (441, 451), (452, 461), (462, 471), (472, 481), (482, 491), 
            (492, 502), (503, 512), (513, 522), (523, 533), (534, 543), (544, 554), (555, 564), (565, 574), (575, 585), 
            (586, 596), (597, 606), (607, 616), (617, 626), (627, 636), (637, 646), (647, 656), (657, 667), (668, 677), 
            (678, 687), (688, 698), (699, 708), (709, 719), (720, 729), (730, 739), (740, 749), (750, 759), (760, 770), 
            (771, 780), (781, 791), (792, 802)], 


    "vol7":  [(0, 9), (10, 19), (20, 29), (30, 39), (40, 51), (52, 62), (63, 73), (74, 83), (84, 92), (93, 103), (104, 114), 
            (115, 125), (126, 136), (137, 146), (147, 156), (157, 166), (167, 177), (178, 188), (189, 199), (200, 209), 
            (210, 219), (220, 229), (230, 241), (242, 252), (253, 262), (263, 273), (274, 283), (284, 294), (295, 305), 
            (306, 315), (316, 326), (327, 337), (338, 348), (349, 359), (360, 370), (371, 381), (382, 391), (392, 401), 
            (402, 411), (412, 421), (422, 432), (433, 442), (443, 452), (453, 462), (463, 473), (474, 483), (484, 495), 
            (496, 505), (506, 515), (516, 525), (526, 535), (536, 546), (547, 556), (557, 567), (568, 578), (579, 589), 
            (590, 600), (601, 611), (612, 621), (622, 631), (632, 642), (643, 653), (654, 664), (665, 675), (676, 686), 
            (687, 696), (697, 706), (707, 717), (718, 727), (728, 739), (740, 750), (751, 761), (762, 771), (772, 782), 
            (783, 792)], 

    "vol8": [(0, 10), (11, 21), (22, 31), (32, 42), (43, 52), (53, 63), (64, 73), (74, 84), (85, 95), (96, 105), (106, 116), 
            (117, 127), (128, 138), (139, 148), (149, 159), (160, 169), (170, 180), (181, 191), (192, 201), (202, 211), 
            (212, 222), (223, 233), (234, 243), (244, 254), (255, 264), (265, 273), (274, 283), (284, 294), (295, 304), 
            (305, 314), (315, 324), (325, 334), (335, 344), (345, 354), (355, 365), (366, 376), (377, 385), (386, 395), 
            (396, 405), (406, 416), (417, 425), (426, 435), (436, 445), (446, 455), (456, 466), (467, 478), (479, 488), 
            (489, 498), (499, 509), (510, 520), (521, 532), (533, 543), (544, 554), (555, 564), (565, 575), (576, 587), 
            (588, 598), (599, 608), (609, 619), (620, 630), (631, 641), (642, 651), (652, 661), (662, 671), (672, 681)], 

    "vol9": [(0, 9), (10, 20), (21, 31), (32, 42), (43, 53), (54, 64), (65, 75), (76, 86), (87, 97), (98, 108), (109, 119), 
            (120, 129), (130, 140), (141, 150), (151, 161), (162, 172), (173, 182), (183, 192), (193, 203), (204, 213), 
            (214, 223), (224, 234), (235, 245), (246, 256), (257, 267), (268, 277), (278, 287), (288, 298), (299, 308), 
            (309, 318), (319, 329), (330, 340), (341, 350), (351, 361), (362, 372), (373, 382), (383, 393), (394, 404), 
            (405, 414), (415, 425), (426, 436), (437, 447), (448, 458), (459, 468), (469, 479), (480, 490), (491, 501), 
            (502, 511), (512, 521), (522, 531), (532, 541), (542, 551), (552, 562), (563, 572), (573, 583), (584, 593), 
            (594, 603), (604, 614), (615, 624), (625, 633), (634, 643), (644, 654), (655, 664), (665, 675), (676, 685), 
            (686, 695), (696, 704), (705, 714), (715, 724), (725, 735)],

    "vol10": [(0, 9), (10, 20), (21, 30), (31, 41), (42, 52), (53, 62), (63, 73), (74, 84), (85, 94), (95, 105), (106, 117), 
            (118, 128), (129, 138),(139, 149), (150, 159), (160, 169), (170, 180), (181, 191), (192, 202), (203, 213), 
            (214, 224), (225, 235), (236, 246), (247, 256), (257, 267),(268, 278), (279, 289), (290, 299), (300, 309), 
            (310, 319), (320, 329), (330, 340), (341, 351), (352, 362), (363, 372), (373, 383), (384, 394), (395, 405), 
            (406, 415), (416, 424), (425, 434), (435, 444), (445, 455), (456, 466), (467, 477), (478, 487), (488, 498), 
            (499, 509), (510, 520), (521, 532), (533, 543), (544, 554), (555, 565), (566, 576), (577, 587), (588, 597), 
            (598, 608), (609, 618), (619, 629), (630, 640), (641, 650), (651, 661), (662, 672), (673, 683), (684, 694), 
            (695, 705), (706, 715), (716, 725), (726, 735), (736, 745), (746, 756), (757, 766), (767, 776), (777, 786)]   
}

**After splitting the total number of papers for each volume should be**:

Volume 1    : 73 |
Volume 2    : 74 |
Volume 3    : 72 |
Volume 4    : 75 |
Volume 5    : 76 |
Volume 6    : 77 |
Volume 7    : 75 |
Volume 8    : 65 |
Volume 9    : 70 |
Volume 10   : 74 |

In [17]:
# Store the splitted PDFs in output folder 
output_folder = "/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023_papers/"

"""
A few PDFs were not splitted properly:
- Some missed a page, usually the last page of References or the first page with Title and Abstract 
- Some papers started with the previous paper's last page of Reference and
- One paper was splitted into two separate PDFS

All PDFs were manually corrected
"""

for vol, pdf_path in pdf_paths.items():
    print(f"Processing {vol}...")
    pages = papers_pages.get(vol)
    if pages:
        total_papers = split_pdfs(pdf_path, pages, output_folder, vol)
        print(f"Total papers processed for {vol}: {total_papers}")

Processing vol1...
Total papers processed for vol1: 73
Processing vol2...
Total papers processed for vol2: 74
Processing vol3...
Total papers processed for vol3: 72
Processing vol4...
Total papers processed for vol4: 75
Processing vol5...
Total papers processed for vol5: 76
Processing vol6...
Total papers processed for vol6: 77
Processing vol7...
Total papers processed for vol7: 75
Processing vol8...
Total papers processed for vol8: 65
Processing vol9...
Total papers processed for vol9: 70
Processing vol10...
Total papers processed for vol10: 74
