# Sentence extraction for analysis purposes 
***


The main purpose of this notebook is to extract sentences from the preprocessed and already extracted data (made by the previous notebook: https://github.com/yasminsarkhosh/machine-learning-bsc-thesis-2024/blob/main/code/data_processing_w_GROBID.ipynb) from MICCAI 2023 research papers. The extracted sentences are based on a list of relevant keywords, organized by paper title. 


**It includes:**
1. Setup and Imports: Import necessary libraries (e.g., os, pandas).
2. File and Data Loading: Load data from CSV files.
    - The dataframe with the selected papers with cancer-relevant content containing extracted text from Abstract til Conclusion
3. A utility function (wrap_text) to wrap sentences at a given width, making the dataframe more readable.
4. Keyword Extraction: Logic to extract keywords and sentences containing those keywords.
    - For demographics a list of keywords has been created and tested to find the optimal list of words, that extracts just about the most neccessary sentences however without excluding too much valuable information or including too much information that is irrelevant and too overwhelming for analysis
5. Functions for keyword and sentence extraction (extract_keywords, extract_keyword_sentences).
6. Dataframe Manipulations: Operations on dataframes like filling NaN values, type casting, and saving to CSV.

1. Setup and Imports: Import necessary libraries (e.g., os, pandas).
***

In [3]:
import os
import re
import pandas as pd
import numpy as np
from collections import Counter


2. File and Data Loading: Load data from CSV files.
***

In [4]:
def load_data(filename):
    """
    Load the dataset from a CSV file.
    """
    return pd.read_csv(filename)

In [5]:
# Dataset with cancer-related papers and extract text
filename = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/outputs/databases/cancer_related_papers_w_text.csv'
df_cancer_related = pd.read_csv(filename)
unique_titles_count = len(df_cancer_related['title'].unique())
print(f"Number of unique titles: {unique_titles_count}")

Number of unique titles: 263


3. A utility function (wrap_text) to wrap sentences at a given width, making the dataframe more readable.
***

In [6]:
def wrap_text(text, width=100):
    """
    A simple function to wrap text at a given width.
    """
    if pd.isnull(text):
        return text
    
    wrapped_lines = []
    for paragraph in text.split('\n'):
        line = ''
        for word in paragraph.split():
            if len(line) + len(word) + 1 > width:
                wrapped_lines.append(line)
                line = word
            else:
                line += (' ' + word if line else word)
        wrapped_lines.append(line)
    return '\n'.join(wrapped_lines)

4. Keyword Extraction: Logic to extract keywords and sentences containing those keywords.
***

In [11]:
def extract_keywords(df, keywords):
    """
    Extract rows from a DataFrame based on matching keywords.
    """
    pattern = r'\b(' + '|'.join(re.escape(keyword) for keyword in keywords) + r')\b'
    sentences_by_paper = {}

    for index, row in df.iterrows():
        sentences = re.findall(pattern, row['text'], flags=re.IGNORECASE | re.DOTALL)
        if sentences:
            paper_title = row['title']
            if paper_title not in sentences_by_paper:
                sentences_by_paper[paper_title] = []
            sentences_by_paper[paper_title].extend(sentences)

    keywords_data = [(title, keyword_sentence) for title, related_group in sentences_by_paper.items() for keyword_sentence in related_group]
    return pd.DataFrame(keywords_data, columns=['title', 'keyword'])


def extract_keyword_sentences(df, keywords):
    """
    Extract sentences containing specified keywords from DataFrame and organize by paper title.
    """
    keyword_pattern = re.compile(r'\b(?:' + '|'.join(keywords) + r')\b', flags=re.IGNORECASE)
    sentences_by_paper = {}

    for title in df['title'].unique():
        text = ' '.join(df[df['title'] == title]['text'])
        sentences = re.split(r'(?<=[.?!])\s+', text)
        keyword_sentences_buffer = []

        for sentence in sentences:
            if keyword_pattern.search(sentence):
                keyword_sentences_buffer.append(sentence)

        sentences_by_paper[title] = keyword_sentences_buffer if keyword_sentences_buffer else ['none']
    
    extracted_data = [(title, keyword_sentence) for title, related_group in sentences_by_paper.items() for keyword_sentence in related_group]
    extracted_df = pd.DataFrame(extracted_data, columns=['title', 'extracted_keyword_sent'])
    extracted_df['extracted_keyword_sent'] = extracted_df['extracted_keyword_sent'].apply(lambda x: wrap_text(x, width=80))

    return extracted_df

5. Functions for keyword and sentence extraction (extract_keywords, extract_keyword_sentences).
***

In [12]:
def extract_keywords_by_category(df, categories):
    """
    Extract keywords from DataFrame based on multiple categories of keywords.
    
    Parameters:
    - df: DataFrame containing the text to search through.
    - categories: A dictionary with category names as keys and lists of keywords as values.

    Returns:
    - A dictionary with category names as keys and DataFrames of extracted keywords as values.
    """
    extracted_data = {}
    for category, keywords in categories.items():
        extracted_data[category] = extract_keywords(df, keywords)
    return extracted_data

# Categories and their corresponding keywords
categories = {
    'age': ['age'],
    'gender': ['gender', 'sex', 'women', 'woman', 'female', 'male'],
    'ethnicity': ['ethnicity', 'ethnicities', 'race', 'white patients', 'black patients'],
    'geolocation': ['geolocation', 'geographical', 'geographic', 'country', 'countries', 'city', 'cities', 
                    'hospital', 'hospitals', 'clinic', 'clinics'],
    'patients': ['patient', 'patients'],
    'bias': ['bias', 'biases'],
}

# Call the function to extract keywords by category
extracted_data_by_category = extract_keywords_by_category(df_cancer_related, categories)

# Now, extracted_data_by_category will hold a dictionary where each key is a category
# and the value is the DataFrame containing the extracted keywords for that category.

# To process or save the extracted data for each category:
for category, data in extracted_data_by_category.items():
    # Process or save data
    print(f"Category: {category}, Rows: {len(data)}")
    # For example, to save:
    #data.to_csv(f"{category}_related_keywords.csv", index=False)


Category: age, Rows: 28
Category: gender, Rows: 36
Category: ethnicity, Rows: 29
Category: geolocation, Rows: 98
Category: patients, Rows: 950
Category: bias, Rows: 154


In [13]:
def save_extracted_data_by_category(data_by_category, output_dir):
    """
    Save each DataFrame in the data_by_category dictionary to a CSV file.
    Each category has a list of corresponding keywords
    
    Parameters:
    - data_by_category (dict): A dictionary with category names as keys and DataFrames as values.
    - output_dir (str): The directory where the CSV files will be saved.
    
    Returns:
    - None
    """
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Iterate through the dictionary
    for category, df in data_by_category.items():
        # Define the output file path
        output_file_path = os.path.join(output_dir, f"{category}_related_keywords.csv")
        # Save the DataFrame to a CSV file
        df.to_csv(output_file_path, index=False)
        print(f"Saved {category} data to {output_file_path}")

# Example usage:
output_directory = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/extracted_data'
save_extracted_data_by_category(extracted_data_by_category, output_directory)


Saved age data to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/extracted_data/age_related_keywords.csv
Saved gender data to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/extracted_data/gender_related_keywords.csv
Saved ethnicity data to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/extracted_data/ethnicity_related_keywords.csv
Saved geolocation data to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/extracted_data/geolocation_related_keywords.csv
Saved patients data to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/extracted_data/patients_related_keywords.csv
Saved bias data to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/extracted_data/bias_related_keywords.csv


In [14]:
def extract_and_save_sentences_by_category(df, categories, output_dir):
    """
    Extract sentences by keywords for each category and save them to CSV files.
    
    Parameters:
    - df: DataFrame containing the text to search through.
    - categories: A dictionary with category names as keys and lists of keywords as values.
    - output_dir (str): The directory where the CSV files will be saved.
    
    Returns:
    - None
    """
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Iterate through the categories and keywords
    for category, keywords in categories.items():
        # Extract sentences containing the keywords
        extracted_df = extract_keyword_sentences(df, keywords)
        
        # Define the output file path
        output_file_path = os.path.join(output_dir, f"{category}_related_sentences.csv")
        
        # Save the DataFrame to a CSV file
        extracted_df.to_csv(output_file_path, index=False)
        print(f"Saved {category} sentences to {output_file_path}")

# Example usage:
categories = {
    'age': ['age'],
    'gender': ['gender', 'sex', 'women', 'woman', 'female', 'male'],
    'ethnicity': ['ethnicity', 'ethnicities', 'race', 'white patients', 'black patients'],
    'geolocation': ['geolocation', 'geographical', 'geographic', 'country', 'countries', 'city', 'cities', 
                    'hospital', 'hospitals', 'clinic', 'clinics'],
    'patients': ['patient', 'patients'],
    'bias': ['bias', 'biases'],
}

output_directory = '/Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/extracted_sentences'
extract_and_save_sentences_by_category(df_cancer_related, categories, output_directory)


Saved age sentences to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/sentences/age_related_sentences.csv
Saved gender sentences to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/sentences/gender_related_sentences.csv
Saved ethnicity sentences to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/sentences/ethnicity_related_sentences.csv
Saved geolocation sentences to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/sentences/geolocation_related_sentences.csv
Saved patients sentences to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/sentences/patients_related_sentences.csv
Saved bias sentences to /Users/yasminsarkhosh/Documents/GitHub/machine-learning-bsc-thesis-2024/code/sentences/bias_related_sentences.csv
