***
## Libraries
***

In [153]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import re
import fitz  # PyMuPDF
import uuid # for generating unique identifiers for each paper

In [None]:
%%capture
# Setup and installation of the required packages
#!pip install spacy nltk PyMuPDF
#!python -m spacy download en_core_web_sm

***
## Important paths
***

In [246]:
# Base path to folder where output files will be stored
output_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/finals'

# Base path to folders 
base_path = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/'

# Path to the MICCAI 2023 pdfs
pdf_path = base_path + 'miccai_2023/'

# Path to the MICCAI 2023 database of all 730 papers and their metadata
database_path = base_path + 'databases/'

In [None]:
def save_to_csv(df, path, title):
    df.to_csv(path + title + '.csv', index=True)

def read_csv_file(path, filename, var_name):
    var_name = pd.read_csv(path + filename + '.csv')
    return var_name

***
### Dataframe 1: MICCAI 2023
***

In [247]:
df_miccai = pd.read_csv(database_path +'updated_database_miccai_2023.csv', index_col=[0], header=[0], encoding='utf-8')
df_miccai

Unnamed: 0,Title,Authors,Page numbers,DOI,Year of publication,Part of publication
0,PET-Diffusion: Unsupervised PET Enhancement Ba...,"Caiwen Jiang, Yongsheng Pan, Mianxin Liu, Lei ...",3-12,10.1007/978-3-031-43907-0_1,2023,1
1,MedIM: Boost Medical Image Representation via ...,"Yutong Xie, Lin Gu, Tatsuya Harada, Jianpeng Z...",13-23,10.1007/978-3-031-43907-0_2,2023,1
2,UOD: Universal One-Shot Detection of Anatomica...,"Heqin Zhu, Quan Quan, Qingsong Yao, Zaiyi Liu,...",24-34,10.1007/978-3-031-43907-0_3,2023,1
3,S2^2ME: Spatial-Spectral Mutual Teaching and E...,"An Wang, Mengya Xu, Yang Zhang, Mobarakol Isla...",35-45,10.1007/978-3-031-43907-0_4,2023,1
4,Modularity-Constrained Dynamic Representation ...,"Qianqian Wang, Mengqi Wu, Yuqi Fang, Wei Wang,...",46-56,10.1007/978-3-031-43907-0_5,2023,1
...,...,...,...,...,...,...
726,ModeT: Learning Deformable Image Registration ...,"Haiqiao Wang, Dong Ni, Yi Wang",740-749,10.1007/978-3-031-43999-5_70,2023,10
727,Non-iterative Coarse-to-Fine Transformer Netwo...,"Mingyuan Meng, Lei Bi, Michael Fulham, Dagan F...",750-760,10.1007/978-3-031-43999-5_71,2023,10
728,DISA: DIfferentiable Similarity Approximation ...,"Matteo Ronchetti, Wolfgang Wein, Nassir Navab,...",761-770,10.1007/978-3-031-43999-5_72,2023,10
729,StructuRegNet: Structure-Guided Multimodal 2D-...,"Amaury Leroy, Alexandre Cafaro, Grégoire Gessa...",771-780,10.1007/978-3-031-43999-5_73,2023,10


In [248]:
df_miccai.info()

#731 entries, 0 to 730 
#6 columns in total
#title, authors, page numbers, doi, year of publication, part of publication
#dtype: int64(2), object(4)

<class 'pandas.core.frame.DataFrame'>
Index: 730 entries, 0 to 730
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Title                730 non-null    object
 1   Authors              730 non-null    object
 2   Page numbers         730 non-null    object
 3   DOI                  730 non-null    object
 4   Year of publication  730 non-null    int64 
 5   Part of publication  730 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 39.9+ KB


There is a total of 730 papers in MICCAI 2023. However, the dataframe contains 731. Examining the dataframe, 
I will first look into the number of papers by publication (Part of Publication)

In [249]:
# count the number of papers for each publication. There is 10 publications in total

print('Number of papers in Publication 1:', df_miccai['Part of publication'].value_counts()[1]) #73
print('Number of papers in Publication 2:', df_miccai['Part of publication'].value_counts()[2]) #73
print('Number of papers in Publication 3:', df_miccai['Part of publication'].value_counts()[3]) #72
print('Number of papers in Publication 4:', df_miccai['Part of publication'].value_counts()[4]) #75
print('Number of papers in Publication 5:', df_miccai['Part of publication'].value_counts()[5]) #76
print('Number of papers in Publication 6:', df_miccai['Part of publication'].value_counts()[6]) #77
print('Number of papers in Publication 7:', df_miccai['Part of publication'].value_counts()[7]) #75
print('Number of papers in Publication 8:', df_miccai['Part of publication'].value_counts()[8]) #65
print('Number of papers in Publication 9:', df_miccai['Part of publication'].value_counts()[9]) #70
print('Number of papers in Publication 10:', df_miccai['Part of publication'].value_counts()[10]) #74

# count the total number of papers in the dataframe
print('Total number of papers:', df_miccai['Part of publication'].value_counts().sum()) #730

Number of papers in Publication 1: 73
Number of papers in Publication 2: 73
Number of papers in Publication 3: 72
Number of papers in Publication 4: 75
Number of papers in Publication 5: 76
Number of papers in Publication 6: 77
Number of papers in Publication 7: 75
Number of papers in Publication 8: 65
Number of papers in Publication 9: 70
Number of papers in Publication 10: 74
Total number of papers: 730


***
## **Selecting a scope of papers from MICCAI 2023**
***
***

**Scope criteria:** Selecting papers, that researched within the field of cancer-related illnesses by searching for cancer-related keywords in the text of each research paper. The text is defined from the start of Abstraction ending with the last line of Conclusion, exluding the Title of the paper, the authors and affiliations, the Acknowlegdement and the References. 

Cancer-related keywords could be words such as 'cancer', 'tumor' and/or 'tumours'.


In [250]:
# Function to extract the full text from the PDF
def extract_text(pdf_path):
    with fitz.open(pdf_path) as doc:
        full_text = ""
        for page in doc:
            full_text += page.get_text()
    return full_text

# Function to find if any of the keywords appear in the section between Abstract and Conclusion
def find_keywords_section(full_text, keywords):
     # Regular expressions to find the end of the affiliations section
    affiliations_end = re.search(r'\d{1,2}\s+(?:\w+\.)+@\w+\.\w{2,}', full_text)
    
    # Start searching from the end of affiliations if found, otherwise from the start of the text
    start_idx = affiliations_end.end() if affiliations_end else 0
    
    # Look for the Abstract and Conclusion sections
    abstract_idx = full_text.lower().find("abstract", start_idx)
    conclusion_idx = full_text.lower().rfind("conclusion", abstract_idx)
    acknowledgements_idx = full_text.lower().find("acknowledgements", conclusion_idx)
    
    # Adjust the end index to stop at Acknowledgements if it exists, otherwise use Conclusion index
    end_search_idx = acknowledgements_idx if acknowledgements_idx != -1 else conclusion_idx
    
    # If neither Abstract nor Conclusion is found, search the entire text
    if abstract_idx == -1 and conclusion_idx == -1:
        searchable_text = full_text[start_idx:]
    else:
        # Search from Abstract to Conclusion or Acknowledgements
        searchable_text = full_text[abstract_idx:end_search_idx].lower()
    
    # Search for each keyword within the determined section, stop at first match
    for keyword in keywords:
        if keyword.lower() in searchable_text:
            return True
    return False

# Function to extract the title from the PDF
def extract_title(pdf_path):
    with fitz.open(pdf_path) as doc:
        first_page_text = doc[0].get_text("text")
        
        # Regular expression to find the start of affiliations or author names
        # Looks for sequences in author lists or affiliations, such as numbers and parentheses
        author_or_affiliations_start = re.search(r'\b[A-Z][a-z]+ [A-Z]\.|\b[A-Z][a-z]+\s[A-Z][a-z]+[1-9]', first_page_text)

        title = ""
        if author_or_affiliations_start:
            # Extract text up to the start of the author list or affiliations as potential title text
            potential_title_text = first_page_text[:author_or_affiliations_start.start()].strip()
            title_lines = potential_title_text.split('\n')
            
            # The title is expected to be a continuous block of text at the top of the page,
            # possibly after a journal header or similar: look for a large continuous block of text.
            for line in reversed(title_lines):
                if line.strip():  
                    # Prepend to keep the title in the correct order
                    title = line + " " + title
                else:
                    # An empty line might indicate the end of the title block
                    break
        else:
            # If no author list or affiliation section is identified, use the first non-empty line
            for line in first_page_text.split('\n'):
                if line.strip():
                    title = line
                    break

        title = title.strip()  # Clean up whitespace
        return title

selected_papers = []
titles = []

# List of keywords to search for
keywords = ["cancer"]

# Iterate over each volume and search for keywords
for i in range(1, 11):  # Volumes 1 to 10
    folder_name = f"miccai23vol{i}"
    folder_path = os.path.join(pdf_path, folder_name)
    
    for pdf in os.listdir(folder_path):
        if pdf.endswith(".pdf"):
            pdf_path_ = os.path.join(folder_path, pdf)
            full_text = extract_text(pdf_path_)
            if find_keywords_section(full_text, keywords):
                selected_papers.append(os.path.join(folder_name, pdf))

# Extract titles from selected papers
for paper_path in selected_papers:
    full_paper_path = os.path.join(pdf_path, paper_path)
    title = extract_title(full_paper_path)
    titles.append(title)

print(f"Extracted titles from {len(titles)} selected papers.")
print(f"With the keyword(s) being {keywords}, {len(selected_papers)} papers were selected as relevant to cancer research.")

Extracted titles from 189 selected papers.
With the keyword(s) being ['cancer'], 189 papers were selected as relevant to cancer research.


In [251]:
# Save the selected papers and their paths to a CSV file
selected_papers_path = pd.DataFrame({"Path": selected_papers, "Title": titles})
#selected_papers_path.to_csv(output_path + 'papers_by_paths_titles.csv', index=False)

Unnamed: 0,Path,Title
0,miccai23vol1/paper_29.pdf,Geometry-Invariant Abnormality
1,miccai23vol1/paper_14.pdf,3D Arterial Segmentation via Single 2D Project...
2,miccai23vol1/paper_11.pdf,TPRO: Text-Prompting-Based Weakly Supervised H...
3,miccai23vol1/paper_58.pdf,vox2vec: A Framework for Self-supervised Contr...
4,miccai23vol1/paper_64.pdf,Pick the Best Pre-trained Model: Towards Trans...
...,...,...
184,miccai23vol10/paper_50.pdf,Solving Low-Dose CT Reconstruction
185,miccai23vol10/paper_46.pdf,Noise2Aliasing: Unsupervised Deep Learning for...
186,miccai23vol10/paper_3.pdf,Revealing Anatomical Structures in PET to Gene...
187,miccai23vol10/paper_21.pdf,Geometric Ultrasound Localization


In [252]:
# Store the paths of the selected papers
selected_papers_paths = []
for i in range(0, len(selected_papers)):  # Volumes 1 to 10
    selected_papers_paths.append([base_path + 'miccai_2023/' + selected_papers[i]])

# Check if the total number of paths is equal to the number of selected papers
len(selected_papers_paths)
selected_papers_paths[:5]

[['/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol1/paper_29.pdf'],
 ['/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol1/paper_14.pdf'],
 ['/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol1/paper_11.pdf'],
 ['/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol1/paper_58.pdf'],
 ['/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/miccai_2023/miccai23vol1/paper_64.pdf']]

In [253]:
import spacy
nlp = spacy.load("en_core_web_sm")

# check text for 'cancer'ArithmeticError
 # Function to extract the full text from the PDF
def extract_text(pdf_path):
    with fitz.open(pdf_path) as doc:
        full_text = ""
        for page in doc:
            full_text += page.get_text()
            # Regular expressions to find the end of the affiliations section
            affiliations_end = re.search(r'\d{1,2}\s+(?:\w+\.)+@\w+\.\w{2,}', full_text)
            
            # Start searching from the end of affiliations if found, otherwise from the start of the text
            start_idx = affiliations_end.end() if affiliations_end else 0
            
            # Look for the Abstract and Conclusion sections
            abstract_idx = full_text.lower().find("abstract", start_idx)
            conclusion_idx = full_text.lower().rfind("conclusion", abstract_idx)
            acknowledgements_idx = full_text.lower().find("acknowledgements", conclusion_idx)
            
            # Adjust the end index to stop at Acknowledgements if it exists, otherwise use Conclusion index
            end_search_idx = acknowledgements_idx if acknowledgements_idx != -1 else conclusion_idx
            
            # If neither Abstract nor Conclusion is found, search the entire text
            if abstract_idx == -1 and conclusion_idx == -1:
                searchable_text = full_text[start_idx:]
            else:
                # Search from Abstract to Conclusion or Acknowledgements
                searchable_text = full_text[abstract_idx:end_search_idx].lower()          
        
    return searchable_text

def extract_relevant_sentences(text, keywords):
    relevant_sentences = []
    doc = nlp(text)
    # Regex pattern that matches whole words from the keywords list, case insensitive
    pattern = r'\b(' + '|'.join(re.escape(keyword) for keyword in keywords) + r')\b'
    for sent in doc.sents:
        if re.search(pattern, sent.text, re.IGNORECASE):
            relevant_sentences.append(sent.text.strip())
    return relevant_sentences


In [254]:
# Initialize an empty dictionary to hold the extracted info
def extract_sents_by_keywords(df, keywords, col_title):
    extracted_sents = {}

    for pdf_path in df:
        path = pdf_path[0]  # pdf_path is a list with the first element being the file path
        text = extract_text(path)
        relevant_sentences = extract_relevant_sentences(text, keywords)
        
        # If no relevant sentences were extracted, include the paper with extracted_sentence set to None
        if not relevant_sentences:
            extracted_sents[path] = [None] # Probably better to change this to 0
        else:
            extracted_sents[path] = relevant_sentences

    # Convert to DataFrame
    rows = []
    for paper_id, sentences in extracted_sents.items():
        if sentences == 0:  # Check if the list contains only None, indicating no sentences were extracted
            rows.append({'paper_id': paper_id, col_title: None})
        else:
            for sentence in sentences:
                rows.append({'paper_id': paper_id, col_title: sentence})

    extracted_sents_df = pd.DataFrame(rows)
    return extracted_sents_df

In [255]:
# keywords = ["cancer"]

# keywords = ['age', 'gender', 'sex', 'women', 'woman', 'female', 'men', 'man', 'male',
#             'geolocation', 'geographical', 'geographic', 'country', 'countries', 'city', 'cities', 'hospital', 'hospitals', 'clinic', 'clinics', 
#             'society', 'societies',
#             'etnicity', 'etnicities', 'race', 
#             'bias', 'biases', 'fair', 'fairness', 'transparency']


keywords = ['age', 'gender', 'sex', 'women', 'woman', 'female', 'male',
            'geolocation', 'geographical', 'geographic', 'country', 'countries', 'city', 'cities', 'hospital', 'hospitals', 'clinic', 'clinics', 
            'society', 'societies',
            'etnicity', 'etnicities', 'race', 
            'bias', 'biases', 'fair', 'unfair', 'fairness', 'transparency',
            'imbalance', 'imbalanced', 'balance', 'balanced']

In [None]:
# Extract sentences by keywords
# sents_by_cancer = extract_sents_by_keywords(selected_papers_paths, keywords, 'extracted_sents_cancer')
# sents_by_cancer.to_csv(output_path + 'extracted_sents_cancer.csv', index=False)

In [256]:
# Extract sentences by keywords
sents_by_keywords = extract_sents_by_keywords(selected_papers_paths, keywords, col_title='extracted_sents_keywords')
# Save to CSV
#sents_by_keywords.to_csv(output_path + 'extracted_sents_keywords.csv', index=False)

In [260]:
filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/finalsextracted_sents_keywords.csv'

extracted_sents = pd.read_csv(filename)
extracted_sents.fillna('None', inplace=True)
extracted_sents
extracted_sents['path'] = extracted_sents['paper_id'].str.split('/').apply(lambda x: '/'.join(x[-2:]))
extracted_sents.rename(columns={'path_long': 'path'}, inplace=True)


Unnamed: 0,paper_id,extracted_sents_keywords,path
0,/Users/yasminsarkhosh/Documents/GitHub/machine...,,miccai23vol1/paper_29.pdf
1,/Users/yasminsarkhosh/Documents/GitHub/machine...,the cohort consists of 141 patients with pancr...,miccai23vol1/paper_14.pdf
2,/Users/yasminsarkhosh/Documents/GitHub/machine...,we distinguish between models selected accordi...,miccai23vol1/paper_14.pdf
3,/Users/yasminsarkhosh/Documents/GitHub/machine...,,miccai23vol1/paper_11.pdf
4,/Users/yasminsarkhosh/Documents/GitHub/machine...,,miccai23vol1/paper_58.pdf
...,...,...,...
656,/Users/yasminsarkhosh/Documents/GitHub/machine...,since pseudo ct is\nconvenient to be integrate...,miccai23vol10/paper_3.pdf
657,/Users/yasminsarkhosh/Documents/GitHub/machine...,"for a fair comparison, we implemented these me...",miccai23vol10/paper_3.pdf
658,/Users/yasminsarkhosh/Documents/GitHub/machine...,such inconsis-\ntent metrics suggest that the ...,miccai23vol10/paper_3.pdf
659,/Users/yasminsarkhosh/Documents/GitHub/machine...,,miccai23vol10/paper_21.pdf


In [261]:
selected_papers_path.rename(columns={'Title': 'title', 'Path': 'path'}, inplace=True)
selected_papers_path.sort_values(by='title', inplace=True)
selected_papers_path.reset_index(drop=True, inplace=True)
selected_papers_path['paper_id'] = range(1, len(selected_papers_path) + 1)
selected_papers_path.to_csv(output_path + 'selected_papers_path_paper_id.csv', index=False)
selected_papers_df = pd.read_csv('/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/database_analysis_output/outputs/selected_papers_final.csv')
selected_papers_df['paper_id'] = range(1, len(selected_papers_df) + 1)
selected_papers_df
#selected_papers_df.to_csv(output_path + 'selected_papers_paper_id.csv', index=False)

Unnamed: 0,title,authors,page_numbers,doi,publication_year,vol_number,paper_id
0,3D Arterial Segmentation via Single 2D Project...,"Alina F. Dima, Veronika A. Zimmer, Martin J. M...",141-151,10.1007/978-3-031-43907-0_14,2023,1,1
1,3D Mitochondria Instance Segmentation with Spa...,"Omkar Thawakar, Rao Muhammad Anwer, Jorma Laak...",613-623,10.1007/978-3-031-43993-3_59,2023,8,2
2,A Spatial-Temporal Deformable Attention Based ...,"Chao Qin, Jiale Cao, Huazhu Fu, Rao Muhammad A...",479-488,10.1007/978-3-031-43895-0_45,2023,2,3
3,A Spatial-Temporally Adaptive PINN Framework f...,"Yubo Ye, Huafeng Liu, Xiajun Jiang, Maryam Tol...",163-172,10.1007/978-3-031-43990-2_16,2023,7,4
4,A Texture Neural Network to Predict the Abnorm...,"Weiguo Cao, Benjamin Howe, Nicholas Rhodes, Su...",470-480,10.1007/978-3-031-43993-3_46,2023,8,5
...,...,...,...,...,...,...,...
184,WeakPolyp: You only Look Bounding Box for Poly...,"Jun Wei, Yiwen Hu, Shuguang Cui, S. Kevin Zhou...",757-766,10.1007/978-3-031-43898-1_72,2023,3,185
185,Weakly-Supervised Positional Contrastive Learn...,"Emma Sarfati, Alexandre Bône, Marc-Michel Rohé...",227-237,10.1007/978-3-031-43907-0_22,2023,1,186
186,X2Vision: 3D CT Reconstruction from Biplanar X...,"Alexandre Cafaro, Quentin Spinat, Amaury Leroy...",699-709,10.1007/978-3-031-43999-5_66,2023,10,187
187,YONA: You Only Need One Adjacent Reference-Fra...,"Yuncheng Jiang, Zixun Zhang, Ruimao Zhang, Gua...",44-54,10.1007/978-3-031-43904-9_5,2023,5,188


In [262]:
#papers = pd.merge(selected_papers_df, selected_papers_path, on='paper_id', how='inner').drop(columns=['title_y']).rename(columns={'title_x': 'title'})
#papers.to_csv(output_path + 'papers.csv', index=False)

Unnamed: 0,title,authors,page_numbers,doi,publication_year,vol_number,paper_id,path
0,3D Arterial Segmentation via Single 2D Project...,"Alina F. Dima, Veronika A. Zimmer, Martin J. M...",141-151,10.1007/978-3-031-43907-0_14,2023,1,1,miccai23vol1/paper_14.pdf
1,3D Mitochondria Instance Segmentation with Spa...,"Omkar Thawakar, Rao Muhammad Anwer, Jorma Laak...",613-623,10.1007/978-3-031-43993-3_59,2023,8,2,miccai23vol8/paper_59.pdf
2,A Spatial-Temporal Deformable Attention Based ...,"Chao Qin, Jiale Cao, Huazhu Fu, Rao Muhammad A...",479-488,10.1007/978-3-031-43895-0_45,2023,2,3,miccai23vol2/paper_46.pdf
3,A Spatial-Temporally Adaptive PINN Framework f...,"Yubo Ye, Huafeng Liu, Xiajun Jiang, Maryam Tol...",163-172,10.1007/978-3-031-43990-2_16,2023,7,4,miccai23vol8/paper_46.pdf
4,A Texture Neural Network to Predict the Abnorm...,"Weiguo Cao, Benjamin Howe, Nicholas Rhodes, Su...",470-480,10.1007/978-3-031-43993-3_46,2023,8,5,miccai23vol6/paper_74.pdf
...,...,...,...,...,...,...,...,...
184,WeakPolyp: You only Look Bounding Box for Poly...,"Jun Wei, Yiwen Hu, Shuguang Cui, S. Kevin Zhou...",757-766,10.1007/978-3-031-43898-1_72,2023,3,185,miccai23vol3/paper_72.pdf
185,Weakly-Supervised Positional Contrastive Learn...,"Emma Sarfati, Alexandre Bône, Marc-Michel Rohé...",227-237,10.1007/978-3-031-43907-0_22,2023,1,186,miccai23vol1/paper_22.pdf
186,X2Vision: 3D CT Reconstruction from Biplanar X...,"Alexandre Cafaro, Quentin Spinat, Amaury Leroy...",699-709,10.1007/978-3-031-43999-5_66,2023,10,187,miccai23vol10/paper_66.pdf
187,YONA: You Only Need One Adjacent Reference-Fra...,"Yuncheng Jiang, Zixun Zhang, Ruimao Zhang, Gua...",44-54,10.1007/978-3-031-43904-9_5,2023,5,188,miccai23vol5/paper_5.pdf


In [263]:
#pd.merge(papers, extracted_sents, on='path', how='inner').to_csv(output_path + 'papers_with_sentences_keywords_metadata.csv', index=False)

In [269]:
filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/finalspapers_with_sentences_keywords_metadata.csv'
papers_with_sentences_df = pd.read_csv(filename)
#pd.merge(papers_with_sentences_df, extracted_sents, on='path', how='inner').drop(columns=['paper_id_y', 'extracted_sents_keywords_y']).rename(columns={'paper_id': 'path_long', 'paper_id_x': 'paper_id', 'extracted_sents_keywords_x' : 'extracted_sents_keywords'}).to_csv(output_path + 'papers_with_sentences_metadata.csv', index=False)

In [270]:
def move_col_position_to_first(df, col_name):
    # Column to move to the first position
    column_to_move = col_name

    # Create a new list of column names with the specified column first
    new_columns = [column_to_move] + [col for col in df.columns if col != column_to_move]

    # Reindex the DataFrame with the new column order
    df = df[new_columns]
    return df

def move_col_position_to_last(df, col_name):
    # Column to move to the first position
    column_to_move = col_name

    # Create a new list of column names with the specified column first
    new_columns = [column_to_move] + [col for col in df.columns if col != column_to_move]

    # Create a new list of column names with the specified column last
    new_columns = [col for col in df.columns if col != column_to_move] + [column_to_move] 

    # Reindex the DataFrame with the new column order
    df = df[new_columns]
    return df

In [271]:
filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/finalspapers_with_sentences_metadata.csv'
df = pd.read_csv(filename)

In [272]:
df = move_col_position_to_last(df, 'extracted_sents_keywords')
#df.to_csv(output_path + 'papers_with_sentences_metadata.csv', index=False)

In [273]:
df.fillna('None', inplace=True)
df

Unnamed: 0,title,authors,page_numbers,doi,publication_year,vol_number,paper_id,path,path_long,extracted_sents_keywords
0,3D Arterial Segmentation via Single 2D Project...,"Alina F. Dima, Veronika A. Zimmer, Martin J. M...",141-151,10.1007/978-3-031-43907-0_14,2023,1,1,miccai23vol1/paper_14.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,the cohort consists of 141 patients with pancr...
1,3D Arterial Segmentation via Single 2D Project...,"Alina F. Dima, Veronika A. Zimmer, Martin J. M...",141-151,10.1007/978-3-031-43907-0_14,2023,1,1,miccai23vol1/paper_14.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,the cohort consists of 141 patients with pancr...
2,3D Arterial Segmentation via Single 2D Project...,"Alina F. Dima, Veronika A. Zimmer, Martin J. M...",141-151,10.1007/978-3-031-43907-0_14,2023,1,1,miccai23vol1/paper_14.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,we distinguish between models selected accordi...
3,3D Arterial Segmentation via Single 2D Project...,"Alina F. Dima, Veronika A. Zimmer, Martin J. M...",141-151,10.1007/978-3-031-43907-0_14,2023,1,1,miccai23vol1/paper_14.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,we distinguish between models selected accordi...
4,3D Mitochondria Instance Segmentation with Spa...,"Omkar Thawakar, Rao Muhammad Anwer, Jorma Laak...",613-623,10.1007/978-3-031-43993-3_59,2023,8,2,miccai23vol8/paper_59.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,"during training\nof mitoem, for the fair compa..."
...,...,...,...,...,...,...,...,...,...,...
11458,Weakly-Supervised Positional Contrastive Learn...,"Emma Sarfati, Alexandre Bône, Marc-Michel Rohé...",227-237,10.1007/978-3-031-43907-0_22,2023,1,186,miccai23vol1/paper_22.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,the method depth-aware manages to correctly en...
11459,Weakly-Supervised Positional Contrastive Learn...,"Emma Sarfati, Alexandre Bône, Marc-Michel Rohé...",227-237,10.1007/978-3-031-43907-0_22,2023,1,186,miccai23vol1/paper_22.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,the method depth-aware manages to correctly en...
11460,X2Vision: 3D CT Reconstruction from Biplanar X...,"Alexandre Cafaro, Quentin Spinat, Amaury Leroy...",699-709,10.1007/978-3-031-43999-5_66,2023,10,187,miccai23vol10/paper_66.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,"by grid search on the\nvalidation set, we sele..."
11461,YONA: You Only Need One Adjacent Reference-Fra...,"Yuncheng Jiang, Zixun Zhang, Ruimao Zhang, Gua...",44-54,10.1007/978-3-031-43904-9_5,2023,5,188,miccai23vol5/paper_5.pdf,/Users/yasminsarkhosh/Documents/GitHub/machine...,"for the fairness of the experiments, we keep t..."


In [194]:
#df_cancer = pd.read_csv('/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/finals/finalspapers_with_sentences_metadata.csv')
#df_cancer.drop(columns='extracted_sents_keywords', inplace=True)
#df_cancer.to_csv(output_path + 'papers_with_sentences_cancer_metadata.csv')

***
***

# **Preliminary analysis of MICCAI 2023 - Selected papers**

***

In [274]:
import pandas as pd
from collections import Counter

# Function to count keywords in a text
def count_keywords(text, keywords):
    # Counter object to count occurrences of each keyword
    counts = Counter()
    for keyword in keywords:
        # Count occurrences of the keyword in the text
        counts[keyword] = text.lower().count(keyword)
    return counts

def agg_keywords(df, col_title, keywords):
    # Aggregate 'extracted_sentences' for each 'title' and count keywords
    results = {}
    for title, group in df.groupby('title'):
        # Combine all extracted sentences into one large text block
        aggregated_text = " ".join(group[col_title].tolist())
        # Count the keywords in this aggregated text
        keyword_counts = count_keywords(aggregated_text, keywords)
        # Store the result
        results[title] = keyword_counts

    # Convert the results dictionary to a DataFrame 
    results_df = pd.DataFrame.from_dict(results, orient='index')
    return results_df


In [322]:
# keywords = ['age', 'gender', 'sex', 'women', 'woman', 'female', 'men', 'man', 'male',
#             'geolocation', 'geographical', 'geographic', 'country', 'countries', 'city', 'cities', 'hospital', 'hospitals', 'clinic', 'clinics', 
#             'society', 'societies',
#             'etnicity', 'etnicities', 'race', 
#             'bias', 'biases', 'fair', 'fairness', 'transparency']

keywords = ['age', 'gender', 'sex', 'women', 'woman', 'female', 'male',
            'geolocation', 'geographical', 'geographic', 'country', 'countries', 'city', 'cities', 'hospital', 'hospitals', 'clinic', 'clinics', 
            'society', 'societies',
            'etnicity', 'etnicities', 'race', 
            'bias', 'biases', 'fair', 'unfair', 'fairness', 'transparency',
            'imbalance', 'imbalanced', 'balance', 'balanced']

In [323]:
agg_keywords_df = agg_keywords(df, 'extracted_sents_keywords', keywords)
save_to_csv(agg_keywords_df, output_path, 'agg_results')

In [324]:
# Reverse the mapping for aggregation
def agg_columns_to_categories(df, keyword_to_category):
    category_to_keywords = {}
    for keyword, category in keyword_to_category.items():
        category_to_keywords.setdefault(category, []).append(keyword)

    # Aggregate columns into categories
    for category, keywords in category_to_keywords.items():
        if category in df.columns:
            # If the category already exists, add to it
            df[category] += df[keywords].sum(axis=1)
        else:
            # Otherwise, create a new column for the category
            df[category] = df[keywords].sum(axis=1)
        # Drop the original keyword columns
        df.drop(columns=keywords, inplace=True)

    return df

In [325]:
# Mapping of keywords to main categories
keyword_to_category = {
    'age'   : 'age_',
    'gender': 'gender_',
    'sex'   : 'gender_',
    'female': 'gender_',
    'women' : 'gender_',
    'woman' : 'gender_',
    'male'  : 'gender_',
    'geolocation'   : 'geolocation_',
    'geographical'  : 'geolocation_',
    'geographic'    : 'geolocation_',
    'country'       : 'geolocation_',
    'countries'     : 'geolocation_',
    'city'          : 'geolocation_',
    'cities'        : 'geolocation_',
    'hospital'      : 'geolocation_',
    'hospitals'     : 'geolocation_',
    'clinic'        : 'geolocation_',
    'clinics'       : 'geolocation_',
    'society'       : 'social factors',
    'societies'     : 'social factors',
    'etnicity'      : 'etnicity_',
    'etnicities'    : 'etnicity_',
    'race'          : 'etnicity_',
    'bias'          : 'bias_',
    'biases'        : 'bias_',
    'unfair'        : 'fairness_',
    'fair'          : 'fairness_',
    'fairness'      : 'fairness_',
    'transparency'  : 'fairness_',
    'imbalance'     : 'fairness_',
    'imbalanced'    : 'fairness_',
    'balance'       : 'fairness_',
    'balanced'      :'fairness_',
}

In [326]:
res = agg_columns_to_categories(agg_keywords_df, keyword_to_category)

In [327]:
save_to_csv(res, output_path, 'agg_counts')

In [328]:
# Convert counts to binary values
def convert_to_binary_values(df):
    columns_to_convert = df.columns.tolist()

    # Convert to binary: 1 if the count is greater than 0, else 0
    for column in columns_to_convert:
        df[column] = df[column].apply(lambda x: 1 if x > 0 else 0)
    
    return df

In [329]:
binary_df =  convert_to_binary_values(res)
save_to_csv(binary_df, output_path, 'agg_columns_binary_values')

In [330]:
filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/finalsagg_columns_binary_values.csv'

agg_binary_df = pd.read_csv(filename)
agg_binary_df

Unnamed: 0.1,Unnamed: 0,age_,gender_,geolocation_,social factors,etnicity_,bias_,fairness_
0,3D Arterial Segmentation via Single 2D Project...,0,1,0,0,0,0,1
1,3D Mitochondria Instance Segmentation with Spa...,0,0,0,0,0,0,1
2,A Spatial-Temporal Deformable Attention Based ...,0,0,0,0,0,0,1
3,A Spatial-Temporally Adaptive PINN Framework f...,1,1,1,0,0,0,1
4,A Texture Neural Network to Predict the Abnorm...,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...
184,WeakPolyp: You only Look Bounding Box for Poly...,0,0,0,0,0,1,0
185,Weakly-Supervised Positional Contrastive Learn...,1,0,1,0,0,0,1
186,X2Vision: 3D CT Reconstruction from Biplanar X...,0,0,0,0,0,0,1
187,YONA: You Only Need One Adjacent Reference-Fra...,0,0,0,0,0,0,1
