### Objective of the Notebook
The notebook aims to extract sentences from already preprocessed MICCAI 2023 research papers based on a list of relevant keywords. The sentences are organized by paper titles and focus on extracting text that might contain specific information related to demographics and other significant categories.

### Input Data Expected
- **CSV File**: The input for this notebook is a CSV file containing preprocessed text from research papers. This data is expected to be structured with columns for paper titles and text content, which were processed in a previous step (as indicated by a linked notebook).

### Output Data/Files Generated
- **CSV Files**: For each category of interest (like age, gender, ethnicity, dataset info), the notebook will generate CSV files containing extracted sentences or keywords. These files will help in further analysis or machine learning tasks focused on these categories.
- **Directories**: The extracted data will be saved into directories specified for cancer-related content, patient-related content, etc., facilitating organized access to this information for further use.

### Assumptions or Important Notes
- **Preprocessing Required**: It assumes that the data has been preprocessed for extraction, meaning any necessary cleaning, formatting, or preliminary analysis has been completed beforehand.
- **Keyword Relevance**: The effectiveness of the extraction process heavily depends on the relevance and comprehensiveness of the keyword list used for extraction. Misclassification or omission of relevant keywords might lead to incomplete or skewed data analysis.
- **Text Structure**: The notebook assumes that the text in the input CSV is well-structured and correctly segmented into sentences. Any irregularities in text structuring might affect the accuracy of sentence extraction.
- **Error Handling**: There seems to be minimal error handling regarding file reading and writing, which could lead to issues if files are not found or directories do not exist. It's crucial to ensure that the input paths are correct and accessible.
- **Scalability**: Depending on the volume of data (number of papers and the length of text in them), the process might be computationally intensive, requiring optimization for handling large datasets efficiently.

# Keyword-Based Sentence Extraction from Selected MICCAI 2023 Research Articles
***

1. Setup and Imports: Import necessary libraries (e.g., os, pandas).

In [1]:
import os
import re
import pandas as pd
import numpy as np
from collections import Counter

2. File and Data Loading: Load data from CSV files.
***

In [2]:
def load_data(filename):
    """
    Load the dataset from a CSV file.
    """
    return pd.read_csv(filename)

3. A utility function (wrap_text) to wrap sentences at a given width, making the dataframe more readable.
***

In [3]:
def wrap_text(text, width=80):
    """
    A simple function to wrap text at a given width.
    """
    if pd.isnull(text):
        return text
    
    wrapped_lines = []
    for paragraph in text.split('\n'):
        line = ''
        for word in paragraph.split():
            if len(line) + len(word) + 1 > width:
                wrapped_lines.append(line)
                line = word
            else:
                line += (' ' + word if line else word)
        wrapped_lines.append(line)
    return '\n'.join(wrapped_lines)

4. Keyword Extraction: Logic to extract keywords and sentences containing those keywords.
***

In [4]:
def extract_keywords(df, keywords):
    """
    Extract rows from a DataFrame based on matching keywords.
    """
    pattern = r'\b(' + '|'.join(re.escape(keyword) for keyword in keywords) + r')\b'
    sentences_by_paper = {}

    for index, row in df.iterrows():
        sentences = re.findall(pattern, row['text'], flags=re.IGNORECASE | re.DOTALL)
        if sentences:
            paper_title = row['title']
            if paper_title not in sentences_by_paper:
                sentences_by_paper[paper_title] = []
            sentences_by_paper[paper_title].extend(sentences)

    keywords_data = [(title, keyword_sentence) for title, related_group in sentences_by_paper.items() for keyword_sentence in related_group]
    return pd.DataFrame(keywords_data, columns=['title', 'keyword'])


def extract_keyword_sentences(df, keywords):
    """
    Extract sentences containing specified keywords from DataFrame and organize by paper title.
    """
    keyword_pattern = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b', flags=re.IGNORECASE)
    sentences_by_paper = {}

    for title in df['title'].unique():
        text = ' '.join(df[df['title'] == title]['text'])
        sentences = re.split(r'(?<=[.?!])\s+', text)
        keyword_sentences_buffer = []

        for sentence in sentences:
            if keyword_pattern.search(sentence):
                keyword_sentences_buffer.append(sentence)

        sentences_by_paper[title] = keyword_sentences_buffer if keyword_sentences_buffer else ['none']
    
    extracted_data = [(title, keyword_sentence) for title, related_group in sentences_by_paper.items() for keyword_sentence in related_group]
    extracted_df = pd.DataFrame(extracted_data, columns=['title', 'extracted_keyword_sent'])
    extracted_df['extracted_keyword_sent'] = extracted_df['extracted_keyword_sent'].apply(lambda x: wrap_text(x, width=80))

    return extracted_df

5. Functions for keyword and sentence extraction (extract_keywords, extract_keyword_sentences).
***

In [6]:
def extract_keywords_by_category(df, categories):
    """
    Extract keywords from DataFrame based on multiple categories of keywords.
    
    Parameters:
    - df: DataFrame containing the text to search through.
    - categories: A dictionary with category names as keys and lists of keywords as values.

    Returns:
    - A dictionary with category names as keys and DataFrames of extracted keywords as values.
    """
    extracted_data = {}
    for category, keywords in categories.items():
        extracted_data[category] = extract_keywords(df, keywords)
    return extracted_data

In [7]:
def save_extracted_data_by_category(data_by_category, output_dir):
    """
    Save each DataFrame in the data_by_category dictionary to a CSV file.
    Each category has a list of corresponding keywords
    
    Parameters:
    - data_by_category (dict): A dictionary with category names as keys and DataFrames as values.
    - output_dir (str): The directory where the CSV files will be saved.
    
    Returns:
    - None
    """
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Iterate through the dictionary
    for category, df in data_by_category.items():
        # Define the output file path
        output_file_path = os.path.join(output_dir, f"{category}_related_keywords.csv")
        # Save the DataFrame to a CSV file
        df.to_csv(output_file_path, index=False)
        print(f"Saved {category} data to {output_file_path}")


# Cancer related papers
# keyword counts

#output_directory = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/outputs/extracted_data_cancer'
#save_extracted_data_by_category(extracted_data_by_category, output_directory)

# Cancer patient related papers
# keyword counts
output_directory = '.../extracted_data_patients'
#save_extracted_data_by_category(extracted_data_by_category, output_directory)

In [8]:
def extract_and_save_sentences_by_category(df, categories, output_dir):
    """
    Extract sentences by keywords for each category and save them to CSV files.
    
    Parameters:
    - df: DataFrame containing the text to search through.
    - categories: A dictionary with category names as keys and lists of keywords as values.
    - output_dir (str): The directory where the CSV files will be saved.
    
    Returns:
    - None
    """
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Iterate through the categories and keywords
    for category, keywords in categories.items():
        # Extract sentences containing the keywords
        extracted_df = extract_keyword_sentences(df, keywords)
        
        # Define the output file path
        output_file_path = os.path.join(output_dir, f"{category}_related_sentences.csv")
        
        # Save the DataFrame to a CSV file
        extracted_df.to_csv(output_file_path, index=False)
        print(f"Saved {category} sentences to {output_file_path}")

In [9]:
# Categories and their corresponding keywords
categories = {
    'age': ['age', 'age', 'young', 'old', 'gender'],
    'gender': ['gender', 'sex', 'women', 'woman', 'female', 'male'],
    'ethnicity': ['ethnicity', 'ethnicities', 'race', 'white patients', 'black patients'],
    'location_info': ['geolocation', 'geographical', 'geographic', 'country', 'countries', 
                    'city', 'cities', 'hospital', 'hospitals', 'clinic', 'clinics', 'continent',
                    'province', 'state', 'region', 'town', 'village', 'area', 'district'],
    'dataset_info': ['dataset', 'datasets', 'data set', 'data sets', 'publicly', 'public', 'private', 'open access', 'open-access'],
    'bias_info': ['bias', 'biases', 'fairness']
}

### Papers with cancer-related content only
***

In [None]:
# Dataset with cancer-related papers and extract text
filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/outputs/databases/cancer_related_papers_w_text.csv'
df_cancer_related = pd.read_csv(filename)
unique_titles_count = len(df_cancer_related['title'].unique())
print(f"Number of unique titles: {unique_titles_count}")

In [None]:
# Cancer-related papers
output_directory = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/outputs/extracted_sentences_cancer'
extract_and_save_sentences_by_category(df_cancer_related, categories, output_directory)

# Cancer related papers
output_directory = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/outputs/extracted_data_cancer'
save_extracted_data_by_category(extracted_data_by_category, output_directory)

### Papers with cancer AND patient content only
***

In [19]:
# Dataset with cancer-patient-related papers and extract text
filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/outputs/databases/papers_with_patients.csv'
df_cancer_patient_related = pd.read_csv(filename)
unique_titles_count = len(df_cancer_patient_related['title'].unique())
print(f"Number of unique titles: {unique_titles_count}")

Number of unique titles: 155


In [24]:
# Output directory where the extracted data will be saved
# Cancer patient related papers

# keyword counts
output_directory = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/outputs/extracted_data_patients'
save_extracted_data_by_category(extracted_data_by_category, output_directory)

# keyword-sentence extractions
output_directory = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/outputs/extracted_sentences_patients'
extract_and_save_sentences_by_category(df_cancer_patient_related, categories, output_directory)

AttributeError: 'list' object has no attribute 'to_csv'

In [1]:
# Check if the number of unique titles is at least 50
unique_titles = df_cancer_related_patients['title'].nunique()
if unique_titles < 100:
    print(f"Warning: Only {unique_titles} unique papers found, less than 100.")

# Randomly select 50 unique titles
selected_titles = df_cancer_related_patients['title'].drop_duplicates().sample(n=min(100, unique_titles), random_state=32)

# Filter the original DataFrame to include only the selected titles
selected_papers_df = df_cancer_related_patients[df_cancer_related_patients['title'].isin(selected_titles)]

# You now have the selected_papers_df DataFrame with 50 randomly selected papers and their related rows
#selected_papers_df.to_csv('/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/outputs/databases/50_selected_papers_2.csv')

NameError: name 'df_cancer_related_patients' is not defined