#### **Objective of the Notebook**
***
The notebook's primary goal is to extract sentences from preprocessed MICCAI 2023 research papers based on specific keywords. These sentences are categorized by paper titles and focus on gathering text that potentially contains crucial information regarding demographics and other significant areas.

#### Input Data Expected
- **CSV File**: The notebook uses a CSV file containing preprocessed text from research papers. This data includes columns for paper titles and text content, which has been prepared in advance, ready for extraction.

#### Output Data/Files Generated
- **CSV Files**: The notebook outputs CSV files for each category of interest (e.g., age, gender, ethnicity, dataset information), containing extracted sentences or keywords. These outputs are intended for further analysis or use in machine learning tasks.
- **Directories**: Extracted data will be organized into specific directories, like those for cancer-related content, patient demographics, etc., to facilitate easy access and subsequent processing.

#### Assumptions or Important Notes
- **Preprocessing Requirement**: This notebook assumes that all necessary data cleaning, formatting, and preliminary analysis have been completed prior to sentence extraction.
- **Keyword Relevance**: The success of the extraction process is highly dependent on the relevance and completeness of the keywords list used. Inaccurate or incomplete keyword lists may result in incomplete or biased data collections.
- **Text Structure**: Assumes that the text in the input CSV is well-structured and appropriately segmented into sentences. Any deviations in text structure may impact the accuracy of the extraction process.
- **Error Handling**: Minimal error handling is noted for file reading and writing, which might lead to issues if input files are missing or directories are not set up correctly. It is vital to verify that all file paths are correct and accessible.
- **Scalability**: The computational intensity of the process may vary based on the amount of data being processed (number of papers and volume of text). Optimization may be required to handle large datasets effectively.


***

#### **Input and Output Files**

| Type  | Description                                      | Details                                                       |
|-------|--------------------------------------------------|---------------------------------------------------------------|
| Input | **CSV File**                                     | Preprocessed text from research papers stored in a CSV file.  |
|       |                                                  | Includes columns for paper titles and text content.           |
| Output| **CSV Files for Each Category**                  | Contains extracted sentences or keywords for specific topics. |
|       | **Directories**                                  | Organized storage for extracted data by category.             |



## **Keyword-Based Sentence Extraction from Selected MICCAI 2023 Research Articles**
***

In [1]:
# Standard library imports
import os
import re
from collections import Counter

# Third-party library imports
import pandas as pd
import numpy as np

In [7]:
filename = '00MICCAI_total_outputs/03MICCAI_all_outputs/03MICCAI_notebook_df_paper_extractions_patients_and_cancer.csv'

#03MICCAI_all_outputs/03MICCAI_notebook_df_paper_extractions_patients_and_cancer.csv
selected_papers = pd.read_csv(filename)
len(selected_papers['title'].unique()) # 155 unique papers

155

### Extract keyword-related sentences from selected papers
***

##### Short keywords list

In [8]:
# List of keywords
keywords_age        = ['age', 'young', 'old']

keywords_gender     = ['gender', 'sex', 'women', 'woman', 'female', 'male']

keywords_etnicity   = ['etnicity', 'etnicities', 'race', 'white patients', 'black patients']

keywords_geoloc     = ['geolocation', 'geographical', 'geographic', 'country', 'countries', 'city', 'cities', 
                       'hospital', 'hospitals', 'clinic', 'clinics', 'continent','province', 'state', 'region', 
                       'town', 'village', 'area', 'district']

keywords_bias       = ['bias', 'biases', 'fairness']

keywords_patients   = ['patient', 'patients']

keywords_data       = ['dataset', 'datasets', 'data set', 'data sets', 'publicly', 'public', 'private', 'open access', 'open-access']


In [9]:
categories = {
    'age'           : ['age', 'young', 'old'],
    'gender'        : ['gender', 'sex', 'women', 'woman', 'female', 'male'],
    'ethnicity'     : ['ethnicity', 'ethnicities', 'race', 'white patients', 'black patients'],
    'location_info' : ['geolocation', 'geographical', 'geographic', 'country', 'countries', 
                       'city', 'cities', 'hospital', 'hospitals', 'clinic', 'clinics', 'continent',
                       'province', 'state', 'region', 'town', 'village', 'area', 'district'],
    'dataset_info'  : ['dataset', 'datasets', 'data set', 'data sets', 'publicly', 'public', 'private', 'open access', 'open-access'],
    'bias_info'     : ['bias', 'biases', 'fairness']
}

***
#### Split the text into words and extract keyword-matches. Group each keyword-match by relatd paper 
***

In [10]:
def wrap_text(text, width=100):
    """
    A simple function to wrap text at a given width.
    """
    if pd.isnull(text):
        return text  # Handle NaN values
    
    wrapped_lines = []
    for paragraph in text.split('\n'):  # Splitting by existing newlines to preserve paragraph breaks
        line = ''
        for word in paragraph.split():
            if len(line) + len(word) + 1 > width:
                wrapped_lines.append(line)
                line = word
            else:
                line += (' ' + word if line else word)
        wrapped_lines.append(line)
    return '\n'.join(wrapped_lines)

### Keywords search only
***

In [11]:
# Split the text into sentences and search for the keywords

def extract_keywords(df, keywords):
    # Search for the whole word in the text
    pattern = r'\b(' + '|'.join(re.escape(keyword) for keyword in keywords) + r')\b'

    # Initialize a dictionary to hold sentences organized by paper title
    sentences_by_paper = {}

    # Loop through each row in the dataframe
    for index, row in df.iterrows():
        # Find all sentences that contain any of the keywords
        sentences = re.findall(pattern, row['text'], flags=re.IGNORECASE | re.DOTALL)
        
        # If there are matching sentences, add them to the dictionary under the paper title
        if sentences:
            paper_title = row['title']
            if paper_title not in sentences_by_paper:
                sentences_by_paper[paper_title] = []
            sentences_by_paper[paper_title].extend(sentences)

    # Sentences_by_paper contains all the sentences that contain keywords, organized by paper title

    # Convert this dictionary into a DataFrame:
    # Create a list of tuples (paper title, sentence)        
    keywords_data = [(title, keyword_sentence) for title, related_group in sentences_by_paper.items() for keyword_sentence in related_group]
    keywords_df = pd.DataFrame(keywords_data, columns=['title', 'keyword'])
    keywords_df['keyword'] = keywords_df['keyword'].apply(lambda x: wrap_text(x, width=100))

    # Store data in output directory if it does not exist
    output_dir = '04MICCAI_notebook_extracted_keywords'
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Save the DataFrame to a CSV file in the output directory
    keywords_df.to_csv(f'{output_dir}/keywords_{keywords}.csv', index=False)

    return keywords_df       

In [12]:
# Extract the keywords from the selected papers and save them to CSV files in the output directory
keywords_age_df = extract_keywords(selected_papers, keywords_age)
keywords_gender_df = extract_keywords(selected_papers, keywords_gender)
keywords_etnicity_df = extract_keywords(selected_papers, keywords_etnicity)
keywords_geoloc_df = extract_keywords(selected_papers, keywords_geoloc)
keywords_bias_df = extract_keywords(selected_papers, keywords_bias)
keywords_patients_df = extract_keywords(selected_papers, keywords_patients)
keywords_data_df = extract_keywords(selected_papers, keywords_data)

***
### Sentence search only by list of keywords
***

In [9]:
# Pseudo code
# regex to split the text into sentences. A sentence is defined as a sequence of characters that ends with a period, question mark, or exclamation mark.
# iterate through the sentences to find those with a keyword from the list of keywords. 
# for each match
    # option 1) concatentinate the previous and next sentences to the sentence with the keyword (if they haven't been added already)
    # option 2) extract sentence with keyword only
# keep track of the sentences already added for each paper title.
# if no matches are found for a paper title, add 'none'.

***



In [13]:
# EXTRACT SENTS WITH KEYWORDS    
# Option 2) Storing keyword sentence only 

def extract_keyword_sentences(df, keywords):
    """
    Extract sentences containing specified keywords from DataFrame and organize by paper title.

    Parameters:
    - df: DataFrame containing the text to search through.
    - keywords: List of keywords to search for in the text.

    Returns:
    - A dictionary with paper titles as keys and lists of sentences containing the keywords as values.
    """

    # Compile the regular expression for matching sentences containing the keywords
    keyword_pattern = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b', flags=re.IGNORECASE)

    # Initialize a dictionary to hold sentences organized by paper title
    sentences_by_paper = {}

    # Loop through each paper title in the DataFrame
    for title in df['title'].unique():
        # Get the full text for the current title
        text = ' '.join(df[df['title'] == title]['text'])
        # Split the text into sentences
        sentences = re.split(r'(?<=[.?!])\s+', text)

        # List to store sentences that contain the keyword
        keyword_sentences_buffer = []

        # Iterate through sentences to find and store sentences that contain the keyword
        for sentence in sentences:
            if keyword_pattern.search(sentence):
                # Add only the sentence with the keyword to the buffer
                keyword_sentences_buffer.append(sentence)

        # Add the sentences to the dictionary, use 'none' if there are no matches
        sentences_by_paper[title] = keyword_sentences_buffer if keyword_sentences_buffer else ['none']
    
    extracted_data = [(title, keyword_sentence) for title, related_group in sentences_by_paper.items() for keyword_sentence in related_group]
    extracted_df = pd.DataFrame(extracted_data, columns=['title', 'extracted_keyword_sent'])
    
    # Wrap title and the extracted sentences to a maximum width of n-characters for better readability
    extracted_df['extracted_keyword_sent'] = extracted_df['extracted_keyword_sent'].apply(wrap_text, width=80)

    output_dir = '04MICCAI_notebook_extracted_sentences'
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    extracted_df.to_csv(f'{output_dir}/extracted_sentences_{keywords}.csv', index=False)

    return extracted_df

In [14]:
# Extract sentences containing keywords and save the results to a CSV file in the output directory
age_extracted_sents_df = extract_keyword_sentences(selected_papers, keywords_age) 
age_extracted_sents_df = extract_keyword_sentences(selected_papers, keywords_age)
gender_extracted_sents_df = extract_keyword_sentences(selected_papers, keywords_gender)
ethnicity_extracted_sents_df = extract_keyword_sentences(selected_papers, keywords_etnicity)
location_extracted_sents_df = extract_keyword_sentences(selected_papers, keywords_geoloc)
bias_extracted_sents_df = extract_keyword_sentences(selected_papers, keywords_bias)
patients_extracted_sents_df = extract_keyword_sentences(selected_papers, keywords_patients)
data_extracted_sents_df = extract_keyword_sentences(selected_papers, keywords_data)

***
***
***