# Brain Tumor Analysis: Data Preprocessing

This Jupyter notebook is part of a project focused on analyzing brain tumor data using machine learning techniques. The dataset we are working with contains histopathological images and corresponding clinical annotations of 3,115 brain tumor patients, covering a wide variety of tumor types. This rich dataset was digitized from a large dedicated brain tumor bank and made publicly available for research purposes. 

Our task here is twofold:

1. **Data Loading**: We begin by loading the `annotation.csv` file which contains clinical annotations for each patient. This includes patient demographics, tumor characteristics (type, grade, subtype), and other relevant clinical details. Note that `label` in the code corresponds to a patient's tumor type.

2. **Data Downloading and Preprocessing**: The histopathological images corresponding to each patient are hosted on the EBRAINS data proxy API. This notebook contains a script to download these `.ndpi` files. Given the irregular sizes of these images, we have implemented a mechanism to ensure that downloaded images have a minimum size (256x256, 512x512, 1024x1024, 2048x2048 pixels) for consistency and ease of further processing.

The notebook also includes a utility to check which files are still missing, in case the download process is interrupted and you end up with an incomplete dataset. This allows for a more controlled download process and helps ensure data integrity.

This data preprocessing step is crucial to prepare our dataset for subsequent exploratory data analysis, feature extraction, and machine learning modeling.

Feel free to explore, modify, and run the cells as needed. If you have any questions or encounter any issues, please reach out to the team.


## Imports and annotations data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob
import requests
import json
from tqdm.notebook import tqdm
import requests
import shutil

with os.add_dll_directory('C://openslide-win64/bin'):
	import openslide

# Read the CSV file containing patient information
data_file = 'annotation.csv'
df = pd.read_csv(data_file)
# replace "/" with "-" in diagnosis
df['diagnosis'] = df['diagnosis'].str.replace('/', '-')

# Count the number of occurrences for each diagnosis
diagnosis_counts = df['diagnosis'].value_counts()

# Get unique diagnosis labels, replace missing values with 'nan'
diagnosis_labels = df['diagnosis'].fillna('nan').unique()

## Data Preprocessing and Image Processing for NDPI Files 

This code snippet represents the image processing part of the Jupyter notebook. It includes functions for loading an NDPI file, finding the magnification level that meets the minimum size requirement, downsampling the image, and saving the processed image and its metadata as a PNG file and a JSON file, respectively.

In [2]:
def get_image_with_min_size(file_path, min_size=2048):
	# Open the NDPI file using OpenSlide
	ndpi_file = openslide.OpenSlide(file_path)
	ndpi_metadata = dict(ndpi_file.properties)

	# Get the number of magnification levels in the NDPI file
	mag_level_count = int(ndpi_metadata['openslide.level-count']) - 1
	target_mag_level = mag_level_count
	while target_mag_level >= 0:
		# Get the width and height of the requested magnification level
		target_ndpi_width = int(ndpi_metadata[f'openslide.level[{target_mag_level}].width'])
		target_ndpi_height = int(ndpi_metadata[f'openslide.level[{target_mag_level}].height'])
		if target_ndpi_width >= min_size and target_ndpi_height >= min_size:
			break
		target_mag_level -= 1
	if target_mag_level < 0:
		print('Error: NDPI file is too small')
		return None, ndpi_metadata

	target_ndpi_width = int(ndpi_metadata[f'openslide.level[{target_mag_level}].width'])
	target_ndpi_height = int(ndpi_metadata[f'openslide.level[{target_mag_level}].height'])

	mag_level = target_mag_level

	# Iterate through magnification levels to find the one with a valid image
	while mag_level >= 0:
		# Reopen the file after trying to read_region, otherwise you get an error
		if mag_level < target_mag_level:
			ndpi_file = openslide.OpenSlide(file_path)

		# Get the width and height of the requested magnification level
		ndpi_width = int(ndpi_metadata[f'openslide.level[{mag_level}].width'])
		ndpi_height = int(ndpi_metadata[f'openslide.level[{mag_level}].height'])
		try:
            # Load the image at the requested magnification level
			ndpi_image = ndpi_file.read_region((0, 0), mag_level, (ndpi_width, ndpi_height))

            # Convert the image to RGB format
			ndpi_image = ndpi_image.convert('RGB')

			if mag_level < target_mag_level:
				# Downsample the image to the target magnification level
				ndpi_image = ndpi_image.resize((target_ndpi_width, target_ndpi_height))

            # Close the NDPI file and return the image
			ndpi_file.close()
			return ndpi_image, ndpi_metadata
		except:
			print('Trying again with mag', mag_level - 1)
			mag_level -= 1
			ndpi_file.close()
	
	print(f'Error: Could not load image from {file_path} at any magnification level')

def save_image(processed_path, ndpi_file_name):
	# Construct the file paths
	file_path = os.path.join(processed_path, ndpi_file_name)
	output_image_path = file_path[:-5] + '.png'

	# Get the image and metadata using the get_image_with_min_size function
	ndpi_image, metadata = get_image_with_min_size(file_path)

	# Save the image
	try:
		ndpi_image.save(output_image_path)
	except:
		print(f'Didn\'t save {output_image_path}')

    # Save the metadata as a JSON file
	metadata_path = os.path.join(processed_path, 'metadata')
	if not os.path.exists(metadata_path):
		os.mkdir(metadata_path)
	# Construct the metadata file path
	metadata_file_path = os.path.join(metadata_path, ndpi_file_name[:-5] + '.json')

	# Write the metadata to the JSON file
	with open(metadata_file_path, 'w') as metadata_file:
		json.dump(metadata, metadata_file)


## Missing File Check

The code compares the processed data files with the diagnosis labels and determines which files are missing by comparing the file counts.

In [19]:
def compare(compare_path):
    missing = []

    # Iterate through the diagnosis labels
    for label in diagnosis_labels:
        folder_path = os.path.join(compare_path, label.replace("/", "-"))

        # Count the number of files with PNG extension in the folder
        num_files = len(glob.glob(os.path.join(folder_path, '*.png')))

        # Remove the folder if it exists but has no files
        if os.path.exists(folder_path) and num_files == 0:
            shutil.rmtree(folder_path)

        # Check if the folder exists and determine the number of present files 
        if not os.path.exists(folder_path):
            missing.append((label, -1))
        else:
            missing.append((label, num_files))

    return missing

# IMPORTANT: set the folder path, where the processed data is stored
folder_path = "C:\\Users\\Kontor\\Github Repos\\Brain-Tumour-Analysis\\processed"

# Compare the processed data with the diagnosis labels
missing_labels = compare(folder_path)

matching_labels = []
non_matching_labels = []
total_missing = 0

# Classify labels as matching or non-matching based on the number of files
# This makes it easy to see which labels are still missing
for label, num_files in missing_labels:
    if label != 'nan':
        if num_files == diagnosis_counts[label]:
            matching_labels.append((label, num_files))
        else:
            total_missing += diagnosis_counts[label] - num_files
            non_matching_labels.append((label, num_files))

matching_labels = sorted(matching_labels, key=lambda x: x[1], reverse=True)
non_matching_labels = sorted(non_matching_labels, key=lambda x: x[1], reverse=True)

#print("Matching labels:")
#for label, num_files in matching_labels:
#    print(f"{label}: {num_files}")
#print()
print('Total missing:', total_missing)
print("Non-matching labels:")
for label, num_files in non_matching_labels:
    print(f"{label}: {num_files}", diagnosis_counts[label])

Total missing: 1
Non-matching labels:
Angiomatous meningioma: 31 32


# Download stream

The size of the full dataset is 3948.2 GB, which means we cannot download the entire dataset in one go. To solve this, we connect to the ebrains data proxy API, allowing us to download each file automatically via script. Progress is also shown while downloading.

In [7]:
# Define the URL and header for the API request
url = 'https://data-proxy.ebrains.eu/api/v1/datasets/8fc108ab-e2b4-406f-8999-60269dc1f994?limit=5000'

# Replace the token with your own
token = "eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJfNkZVSHFaSDNIRmVhS0pEZDhXcUx6LWFlZ3kzYXFodVNJZ1RXaTA1U2k0In0.eyJleHAiOjE2ODQ2ODY1OTcsImlhdCI6MTY4NDA4MTgyMiwiYXV0aF90aW1lIjoxNjg0MDgxNzk3LCJqdGkiOiJmYmI5ZDk2Ny0xYWFlLTRlMGEtYmZmYi0yMmIxYTQ3ODEwNGYiLCJpc3MiOiJodHRwczovL2lhbS5lYnJhaW5zLmV1L2F1dGgvcmVhbG1zL2hicCIsImF1ZCI6InRlYW0iLCJzdWIiOiI4NTBlNTA2Ni1mNGQwLTRjOGItYmNiYy02ZjM4ZWQzYjIzMjIiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJkYXRhLXByb3h5LWZyb250Iiwibm9uY2UiOiJlNGU0MTc4MC05NTg5LTQyYmYtODJiYi05NjVhZDBlYjRiNWYiLCJzZXNzaW9uX3N0YXRlIjoiY2ViNDI0ODctOTM4ZS00NDFlLWE4NDAtNmU5YjgzZjU4MWNmIiwiYWNyIjoiMCIsImFsbG93ZWQtb3JpZ2lucyI6WyJodHRwczovL2RhdGEtcHJveHkuZWJyYWlucy5ldSIsImh0dHBzOi8vZGF0YS1wcm94eS1wcGQuZWJyYWlucy5ldSJdLCJzY29wZSI6InByb2ZpbGUgcm9sZXMgZW1haWwgb3BlbmlkIHRlYW0iLCJzaWQiOiJjZWI0MjQ4Ny05MzhlLTQ0MWUtYTg0MC02ZTliODNmNTgxY2YiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwibmFtZSI6IkZhYmlhbiBLb250b3IiLCJtaXRyZWlkLXN1YiI6IjMxMDM5NiIsInByZWZlcnJlZF91c2VybmFtZSI6ImZrb250b3IiLCJnaXZlbl9uYW1lIjoiRmFiaWFuIiwiZmFtaWx5X25hbWUiOiJLb250b3IiLCJlbWFpbCI6ImYua29udG9yQHN0dWQudW5pLWhlaWRlbGJlcmcuZGUifQ.qJskDcRiyZTRF1GVqn9tptoGB8vucBTcBFIksZq_3TCnsWKHYcp5FxNUyETHf4D53FvENBS2oaZ2Qm_eqog2fBM6l7FBCmkMPULPm2AA-OMzHh2Jz0T9RmHOVaTXDbLKmXFVgSn0414Bh3FdaG8UDdBsLOmejFwhGdEgE1ztXf40viAkX7lUrtCcJDrHyZIgvFpSy68_nXtm4npPCjgHyamniI7iaI_ZZrjDQnLgDVP_JoWtXtsnyc_70uHGN2rgWUeBKPqYmAjst5f1LtiRfN9pvD9E0X9GTA9KS2VGBVcdvauhUmQmRnVEYF68jMYmWabqP4u6_flG5h-aD_teGg"
header = f"Bearer {token}"

# Function to send an API request
def request(url, header):
	# Send the request, including the authorization token
	# You can get it by inspecting the network traffic in your browser on the EBRAINS Data-Proxy website
	response = requests.get(url, headers={'Authorization': header})
	if response.status_code == 200:
		return response.json()
	else:
		print(response.json())
		print(f'Request failed with status code {response.status_code}')
		return None

# Function to download a file with progress display	
def download(url, filename, header):
	
	with requests.get(url, headers={'Authorization': header}, stream=True) as r:
		total_size = int(r.headers.get('content-length', 0))
		block_size = 1024 # 1 Kibibyte

		# Initialize the progress bar
		progress_bar = tqdm(total=total_size, unit="iB", unit_scale=True)

		# Write the downloaded data to the file while displaying the progress bar
		with open(filename, "wb") as f:
			for chunk in r.iter_content(block_size):
				progress_bar.update(len(chunk))
				f.write(chunk)
		
		progress_bar.close()
		r.close()
		
	# Verify if the download was successful
	if total_size != 0 and progress_bar.n != total_size:
		print("ERROR, something went wrong")
		return False
	else:
		return True

# Send request that returns information about the dataset
data = request(url, header)

# Automated download

This code iterates over the objects in the retrieved data from the ebrains data proxy API. For each object, it extracts the label and file name, creates the necessary folder and file paths, and downloads the NDPI file if the corresponding PNG file doesn't exist. It then processes the downloaded file by converting it to PNG format and removes the NDPI file. The code keeps track of the total number of files downloaded and displays the count at the end.

In [20]:
total_downloads = 0

# The base url we will use to download the files
api_base_url = 'https://data-proxy.ebrains.eu/api/v1/datasets/8fc108ab-e2b4-406f-8999-60269dc1f994/'

# Iterate over each "object" in the dataset, which are the files
for obj in data['objects']:
	name = obj['name'].replace('v1.0/', '')
	name = name.replace('Embryonal tumour with multilayered rosette, C19MC-altered', 'Embryonal tumour with multilayered rosettes, C19MC-altered')
	
	if name == 'annotation.csv':
		continue
	
	# Get the label of the current object (i.e. "Embryonal carcinoma")
	# and the file (i.e. "a1980534-357f-11eb-a65a-001a7dda7111.ndpi")
	label, file = name.split('/')

	# Create a folder path for the current label if it doesn't exist yet
	data_folder_path = os.path.join(folder_path, label)
	if not os.path.exists(data_folder_path):
		os.makedirs(data_folder_path)

	file_path = os.path.join(data_folder_path, file)
	png_path = file_path.replace('.ndpi', '.png')

	# Skip the current object if the label is not present in diagnosis_counts
	if not label in diagnosis_counts and 'Control' not in label:
		continue

	# Check if the PNG file already exists, otherwise start the download
	if not os.path.exists(png_path):
		n_files = len(glob.glob(os.path.join(data_folder_path, '*.png')))
		if 'Control' in label:
			print(f'{n_files+1}/47: Starting download of', name)
		else:
			print(f'{n_files+1}/{diagnosis_counts[label]}: Starting download of', name)
		print(os.path.exists(file_path))

		# If the file already exists, it means the conversation to PNG failed, so we skip
		if not os.path.exists(file_path):
			# Try to download. If it success, we try to convert to PNG.
			if download(api_base_url + 'v1.0/' + name, file_path, header):
				total_downloads += 1
				try:
					save_image(data_folder_path, file)
					os.remove(file_path)
				except:
					print('Failed to convert', file_path)

print('Finished downloading', total_downloads, 'files')

32/32: Starting download of Angiomatous meningioma/a1982bd3-357f-11eb-aec7-001a7dda7111.ndpi
False


  0%|          | 0.00/2.10G [00:00<?, ?iB/s]

Failed to convert C:\Users\Kontor\Github Repos\Brain-Tumour-Analysis\processed\Angiomatous meningioma\a1982bd3-357f-11eb-aec7-001a7dda7111.ndpi
Finished downloading 1 files


## Resize

The following script converts the dataset into one of regular size, by iterating over the files in the `processed` folder, loading each image, and resizing it to the targeted size. It then saves the resized image in the `resized` folder. We choose sizes of 512x512, 1024x1024, and 2048x2048 pixels, as these are common sizes used in image processing and machine learning applications.

In [24]:
import os
from PIL import Image

def resize_images(folder_path, target_size, output_folder):
    # Create the output folder based on the target size
    output_folder = os.path.join(output_folder, f"{target_size[0]}x{target_size[1]}")
    os.makedirs(output_folder, exist_ok=True)

     # Traverse through the data folder and its subfolders
    for root, subfolders, files in os.walk(folder_path):
        for subfolder in subfolders:
            subfolder_path = os.path.join(root, subfolder)
            output_subfolder_path = os.path.join(output_folder, os.path.relpath(subfolder_path, folder_path))

            # Copy the 'metadata' subfolder as-is without resizing
            if 'metadata' in subfolder and not os.path.exists(output_subfolder_path):
                shutil.copytree(subfolder_path, output_subfolder_path)
                continue

            # Create the output subfolder if it doesn't exist
            if not os.path.exists(output_subfolder_path):
                os.makedirs(output_subfolder_path)
            
            # Resize each PNG image in the subfolder
            for file in os.listdir(subfolder_path):
                if file.endswith(".png"):
                    file_path = os.path.join(subfolder_path, file)
                    output_file_path = os.path.join(output_subfolder_path, file)
                    try:
                         # Open the image, resize it, and save the resized image
                        with Image.open(file_path) as img:
                            resized_img = img.resize(target_size)
                            resized_img.save(output_file_path)
                    except:
                        # Print an error message if resizing fails
                        print(f'Error: Could not resize {file_path}')

folder_path = "C:\\Users\\Kontor\\Github Repos\\Brain-Tumour-Analysis\\processed"
output_path = "C:\\Users\\Kontor\\Github Repos\\Brain-Tumour-Analysis\\resized"

resize_images(folder_path, (256, 256), output_path)
resize_images(folder_path, (512, 512), output_path)
resize_images(folder_path, (1024, 1024), output_path)
resize_images(folder_path, (2048, 2048), output_path)