### Objective of the Notebook
The notebook aims to extract sentences from already preprocessed MICCAI 2023 research papers based on a list of relevant keywords. The sentences are organized by paper titles and focus on extracting text that might contain specific information related to demographics and other significant categories.

### Input Data Expected
- **CSV File**: The input for this notebook is a CSV file containing preprocessed text from research papers. This data is expected to be structured with columns for paper titles and text content, which were processed in a previous step (as indicated by a linked notebook).

### Output Data/Files Generated
- **CSV Files**: For each category of interest (like age, gender, ethnicity, dataset info), the notebook will generate CSV files containing extracted sentences or keywords. These files will help in further analysis or machine learning tasks focused on these categories.
- **Directories**: The extracted data will be saved into directories specified for cancer-related content, patient-related content, etc., facilitating organized access to this information for further use.

### Assumptions or Important Notes
- **Preprocessing Required**: It assumes that the data has been preprocessed for extraction, meaning any necessary cleaning, formatting, or preliminary analysis has been completed beforehand.
- **Keyword Relevance**: The effectiveness of the extraction process heavily depends on the relevance and comprehensiveness of the keyword list used for extraction. Misclassification or omission of relevant keywords might lead to incomplete or skewed data analysis.
- **Text Structure**: The notebook assumes that the text in the input CSV is well-structured and correctly segmented into sentences. Any irregularities in text structuring might affect the accuracy of sentence extraction.
- **Error Handling**: There seems to be minimal error handling regarding file reading and writing, which could lead to issues if files are not found or directories do not exist. It's crucial to ensure that the input paths are correct and accessible.
- **Scalability**: Depending on the volume of data (number of papers and the length of text in them), the process might be computationally intensive, requiring optimization for handling large datasets efficiently.

# Keyword-Based Sentence Extraction from Selected MICCAI 2023 Research Articles
***

In [2]:
import os
import re
import pandas as pd
import numpy as np
from collections import Counter

In [9]:
filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/project_submission/00_project_notebook/submission notebooks/03MICCAI_notebook_papers_with_patients.csv'
selected_papers = pd.read_csv(filename)
len(selected_papers['title'].unique())

155

### Extract keyword-related sentences from selected papers
***

##### Short keywords list

In [6]:
# List of keywords
keywords_age        = ['age']

keywords_gender     = ['gender', 'sex', 'women', 'woman', 'female', 'male']

keywords_etnicity   = ['etnicity', 'etnicities', 'race', 'white patients', 'black patients']

keywords_geoloc     = ['geolocation', 'geographical', 'geographic', 'country', 'countries', 'city', 'cities', 
                        'hospital', 'hospitals', 'clinic', 'clinics']

keywords_bias       = ['bias', 'biases']

keywords_fairness   = ['fairness']

keywords_patients   = ['patient', 'patients']

keywords_data       = ['dataset', 'datasets', 'data collection', 'data collections']

keywords_collected  = ['collected']

***
#### Split the text into words and extract keyword-matches. Group each keyword-match by relatd paper 
***

In [10]:
def wrap_text(text, width=100):
    """
    A simple function to wrap text at a given width.
    """
    if pd.isnull(text):
        return text  # Handle NaN values
    
    wrapped_lines = []
    for paragraph in text.split('\n'):  # Splitting by existing newlines to preserve paragraph breaks
        line = ''
        for word in paragraph.split():
            if len(line) + len(word) + 1 > width:
                wrapped_lines.append(line)
                line = word
            else:
                line += (' ' + word if line else word)
        wrapped_lines.append(line)
    return '\n'.join(wrapped_lines)

### Keywords search only
***

In [11]:
# Split the text into sentences and search for the keywords

def extract_keywords(df, keywords):
    # Search for the whole word in the text
    pattern = r'\b(' + '|'.join(re.escape(keyword) for keyword in keywords) + r')\b'

    # Initialize a dictionary to hold sentences organized by paper title
    sentences_by_paper = {}

    # Loop through each row in the dataframe
    for index, row in df_cancer_related.iterrows():
        # Find all sentences that contain any of the keywords
        sentences = re.findall(pattern, row['text'], flags=re.IGNORECASE | re.DOTALL)
        
        # If there are matching sentences, add them to the dictionary under the paper title
        if sentences:
            paper_title = row['title']
            if paper_title not in sentences_by_paper:
                sentences_by_paper[paper_title] = []
            sentences_by_paper[paper_title].extend(sentences)

    # Sentences_by_paper contains all the sentences that contain keywords, organized by paper title

    # Convert this dictionary into a DataFrame:
    # Create a list of tuples (paper title, sentence)        
    keywords_data = [(title, keyword_sentence) for title, related_group in sentences_by_paper.items() for keyword_sentence in related_group]
    keywords_df = pd.DataFrame(keywords_data, columns=['title', 'keyword']) 

    return keywords_df       

In [12]:
keywords_df = extract_keywords(selected_papers, keywords_age)
keywords_df = extract_keywords(selected_papers, keywords_gender)
keywords_df = extract_keywords(selected_papers, keywords_etnicity)
keywords_df = extract_keywords(selected_papers, keywords_geoloc)
keywords_df = extract_keywords(selected_papers, keywords_patients)
keywords_df = extract_keywords(selected_papers, keywords_bias)


In [8]:
#keywords_df.to_csv("/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/outputs/data/bias_related_keywords_.csv", index=False)

***
### Sentence search only by list of keywords
***

In [9]:
# Pseudo code
# regex to split the text into sentences. A sentence is defined as a sequence of characters that ends with a period, question mark, or exclamation mark.
# iterate through the sentences to find those with a keyword from the list of keywords. 
# for each match
    # option 1) concatentinate the previous and next sentences to the sentence with the keyword (if they haven't been added already)
    # option 2) extract sentence with keyword only
# keep track of the sentences already added for each paper title.
# if no matches are found for a paper title, add 'none'.

***



In [13]:
# EXTRACT SENTS WITH KEYWORDS    
# Option 2) Storing keyword sentence only 

def extract_keyword_sentences(df, keywords):
    """
    Extract sentences containing specified keywords from DataFrame and organize by paper title.

    Parameters:
    - df: DataFrame containing the text to search through.
    - keywords: List of keywords to search for in the text.

    Returns:
    - A dictionary with paper titles as keys and lists of sentences containing the keywords as values.
    """

    # Compile the regular expression for matching sentences containing the keywords
    keyword_pattern = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b', flags=re.IGNORECASE)

    # Initialize a dictionary to hold sentences organized by paper title
    sentences_by_paper = {}

    # Loop through each paper title in the DataFrame
    for title in df['title'].unique():
        # Get the full text for the current title
        text = ' '.join(df[df['title'] == title]['text'])
        # Split the text into sentences
        sentences = re.split(r'(?<=[.?!])\s+', text)

        # List to store sentences that contain the keyword
        keyword_sentences_buffer = []

        # Iterate through sentences to find and store sentences that contain the keyword
        for sentence in sentences:
            if keyword_pattern.search(sentence):
                # Add only the sentence with the keyword to the buffer
                keyword_sentences_buffer.append(sentence)

        # Add the sentences to the dictionary, use 'none' if there are no matches
        sentences_by_paper[title] = keyword_sentences_buffer if keyword_sentences_buffer else ['none']
    
    extracted_data = [(title, keyword_sentence) for title, related_group in sentences_by_paper.items() for keyword_sentence in related_group]
    extracted_df = pd.DataFrame(extracted_data, columns=['title', 'extracted_keyword_sent'])
    
    # Wrap title and the extracted sentences to a maximum width of n-characters for better readability
    extracted_df['extracted_keyword_sent'] = extracted_df['extracted_keyword_sent'].apply(wrap_text, width=80)

    return extracted_df

In [19]:
# extracted_data = extract_keyword_sentences(df_cancer_related, keywords_age).to_csv('age_related_sentences.csv')
# extracted_data = extract_keyword_sentences(df_cancer_related, keywords_gender).to_csv('gender_related_sentences.csv')
# extracted_data = extract_keyword_sentences(df_cancer_related, keywords_etnicity).to_csv('etnicity_related_sentences.csv')
# extracted_data = extract_keyword_sentences(df_cancer_related, keywords_geoloc).to_csv('geoloc_related_sentences.csv')
# extracted_data = extract_keyword_sentences(df_cancer_related, keywords_patients).to_csv('patient_related_sentences.csv')
# extracted_data = extract_keyword_sentences(df_cancer_related, keywords_bias).to_csv('bias_related_sentences.csv')


***
***
***

***
***
***

In [30]:
anno_demo = pd.read_csv('/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/outputs/databases/annotations/anno_demographics.csv', delimiter=';')

In [34]:
anno_demo = anno_demo.fillna(0)

In [38]:
anno_demo['volume'] = anno_demo['volume'].astype(int)
anno_demo['etnicity is used'] = anno_demo['etnicity is used'].astype(int)

In [40]:
anno_demo.to_csv('/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/outputs/databases/annotations/anno_demographics.csv')