<a href="https://colab.research.google.com/github/rydeveraumn/csci-5561-flying-dolphins/blob/main/Breast_Cancer_Preprocessing_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Steps for the Pipeline

1. (If runnning in Colab) Mount google drive so access zipped data file
2. Unzip file and save in local runtime directory. (Unzipping the folder in drive causes issues).
3. Repeat all the following steps for each image in the dataset

3-1. Apply the Rolling Ball Algorithm to the image. Current believe ball size of 5 works best. Normalize the image at the output

3-2. Apply Huangs fuzzy threshold algorithm to get the background of the image.

3-3. Use morphological transforms to denoise the background image. The sequence I found works best is 1 binary erosion (2 x 4), followed by 6 binary dilations (2 x 3). 

3-4. Perform a bitwise AND between the output of the rolling ball algorithm and the background.

3-5. Rectify images to all be right orientated.

3-6. Remove the pectoral muscle using Canny edge dectection and the hough line transform.

3-7. Crop image to only include the breast region of interest

3-8. Resize image to uniform size

3-9. Save the processed image in an output folder in the local drive

4. Zip the output folder with the processed images

5. Save the zipped output folder back in google drive


##Rolling Ball Algorithm:

In [None]:
#Install dependecies for Rolling Ball
!pip install opencv-rolling-ball
import numpy as np
from PIL import Image
from cv2_rolling_ball import subtract_background_rolling_ball
from google.colab.patches import cv2_imshow
import cv2

#This function performs the rolling ball algorithm on an image given the image and ball size
#Also will normalize the output image
#Recommended ball_size = 5
def rollingBall(img, ball_size):
  final_img, lightbackground = subtract_background_rolling_ball(img, ball_size, light_background = False, use_paraboloid = False, do_presmooth =False)
  image_output = cv2.normalize(final_img, None, 0, 255, cv2.NORM_MINMAX, dtype=cv2.CV_32F)
  return image_output

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opencv-rolling-ball
  Downloading opencv-rolling-ball-1.0.1.tar.gz (6.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: opencv-rolling-ball
  Building wheel for opencv-rolling-ball (setup.py) ... [?25l[?25hdone
  Created wheel for opencv-rolling-ball: filename=opencv_rolling_ball-1.0.1-py3-none-any.whl size=6893 sha256=81810bc74048f2237647e87dbc18d27f46624f9b209a3b8f4fe67b518ff50fdf
  Stored in directory: /root/.cache/pip/wheels/b6/cf/88/7ebc10f8425fbc46777a6e6a3d6964d35277134981ca85757b
Successfully built opencv-rolling-ball
Installing collected packages: opencv-rolling-ball
Successfully installed opencv-rolling-ball-1.0.1


##Huangs Fuzzy Thresholding Algorithm:


In [None]:
#Calculates the threshold needed for binary image
import math
import numpy

#Main function to find threshold
#Takes in a histogram and outputs the threshold
def Huang(data):
    threshold=-1;
    first_bin=  0
    for ih in range(254):
        if data[ih] != 0:
            first_bin = ih
            break
    last_bin=254;
    for ih in range(254,-1,-1):
        if data[ih] != 0:
            last_bin = ih
            break
    term = 1.0 / (last_bin - first_bin)
    mu_0 = numpy.zeros(shape=(254,1))
    num_pix = 0.0
    sum_pix = 0.0
    for ih in range(first_bin,254):
        sum_pix = sum_pix + (ih * data[ih])
        num_pix = num_pix + data[ih]
        mu_0[ih] = sum_pix / num_pix 
    min_ent = float("inf")
    for it in range(254): 
        ent = 0.0
        for ih in range(it):
            mu_x = 1.0 / ( 1.0 + term * math.fabs( ih - mu_0[it]))
            if ( not ((mu_x  < 1e-06 ) or (mu_x > 0.999999))):
                ent = ent + data[ih] * (-mu_x * math.log(mu_x) - (1.0 - mu_x) * math.log(1.0 - mu_x) ) 
        if (ent < min_ent):
            min_ent = ent
            threshold = it
    return threshold
  
#This function takes in
#an image and returns the binary version of
#it using the Huang algorithm
def getBinaryImage(image):
  histogram, bin_edges = numpy.histogram(image, bins=range(257))
  threshold = Huang(histogram)
  binary_image = numpy.where(image > threshold, 255, 0)
  return binary_image

##Morphological Transforms:


In [None]:
#Morphological Transformations for denoising the background image
from PIL import Image
from scipy.ndimage import binary_erosion, binary_dilation

#Takes in a binary background image and returns
#the denoised version
#Recommended values: erosion_size = (2, 4), erosion_iterations = 1, dilation_size = (2, 3), dilation_iterations = 6
def denoise(image, erosion_size, erosion_iterations, dilation_size, dilation_iterations):
  # Define the structuring element for the erosion and dilation operations
  structuring_element = numpy.ones(erosion_size, dtype=numpy.uint8)

  # Perform erosion followed by dilation to remove noise and fill gaps
  # This combination seems to do well in removing the background without noise
  eroded_array = binary_erosion(image, structuring_element)
  for i in range(erosion_iterations - 1):
    eroded_array = binary_erosion(eroded_array, structuring_element)

  structuring_element = numpy.ones(dilation_size, dtype=numpy.uint8)
  denoised_array = binary_dilation(eroded_array, structuring_element)
  for i in range(dilation_iterations - 1):
    denoised_array = binary_dilation(denoised_array, structuring_element)
  
  return denoised_array

##Full Background Removal:

In [None]:
#This function will take an image and use all of the previous helper functions to remove a background
#Inputs are all controllable variables in previous functions for maximum flexability
def removeBackground(img, ball_size, erosion_size, erosion_iterations, dilation_size, dilation_iterations):
  img_proc = rollingBall(img, ball_size)
  background_img = getBinaryImage(img_proc)
  background_denoised = denoise(background_img, erosion_size, erosion_iterations, dilation_size, dilation_iterations)
  background_denoised = background_denoised.astype(numpy.uint8)
  img_proc = numpy.array(img_proc)
  img_noback = numpy.multiply(background_denoised, img_proc)
  return img_noback



##Pectoral Removal:

In [None]:
#Make sure dependencies are imported
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pylab as pylab
from skimage import io
from skimage import color
import cv2

#Orients image to right if it is left sided
def right_orient_mammogram(image):
    left_nonzero = cv2.countNonZero(image[:, 0:int(image.shape[1]/2)])
    right_nonzero = cv2.countNonZero(image[:, int(image.shape[1]/2):])
    
    if(left_nonzero < right_nonzero):
        image = cv2.flip(image, 1)

    return image

#Apply Canny Edge Detection
from skimage.feature import canny
from skimage.filters import sobel

def apply_canny(image):
    canny_img = canny(image, 6)
    return sobel(canny_img)

#Apply Hough Transform
from skimage.transform import hough_line, hough_line_peaks

def get_hough_lines(canny_img):
    h, theta, d = hough_line(canny_img)
    lines = list()
    #print('\nAll hough lines')
    for _, angle, dist in zip(*hough_line_peaks(h, theta, d)):
        #print("Angle: {:.2f}, Dist: {:.2f}".format(np.degrees(angle), dist))
        x1 = 0
        y1 = (dist - x1 * np.cos(angle)) / np.sin(angle + 0.000000001)
        x2 = canny_img.shape[1]
        y2 = (dist - x2 * np.cos(angle)) / np.sin(angle + 0.000000001)
        lines.append({
            'dist': dist,
            'angle': np.degrees(angle),
            'point1': [x1, y1],
            'point2': [x2, y2]
        })
    
    return lines

#Shortlisting lines
def shortlist_lines(lines):
    MIN_ANGLE = 10
    MAX_ANGLE = 70
    MIN_DIST  = 5
    MAX_DIST  = 200
    
    shortlisted_lines = [x for x in lines if 
                          (x['dist']>=MIN_DIST) &
                          (x['dist']<=MAX_DIST) &
                          (x['angle']>=MIN_ANGLE) &
                          (x['angle']<=MAX_ANGLE)
                        ]
    #print('\nShorlisted lines')
    #for i in shortlisted_lines:
        #print("Angle: {:.2f}, Dist: {:.2f}".format(i['angle'], i['dist']))
        
    return shortlisted_lines

#Remove the Pectoral Region
from skimage.draw import polygon
def remove_pectoral_region(shortlisted_lines):
    shortlisted_lines.sort(key = lambda x: x['dist'])
    pectoral_line = shortlisted_lines[0]
    d = pectoral_line['dist']
    theta = np.radians(pectoral_line['angle'])
    
    x_intercept = d/np.cos(theta)
    y_intercept = d/np.sin(theta)
    
    return polygon([0, 0, y_intercept], [0, x_intercept, 0])

#Only use this function, uses all of the above to return the image with the pectoral removed
def removePectoral(img):
  image = right_orient_mammogram(img)
  canny_image = apply_canny(image)
  lines = get_hough_lines(canny_image)
  shortlisted_lines = shortlist_lines(lines)
  rr, cc = remove_pectoral_region(shortlisted_lines)
  image[rr, cc] = 0
  return image



##Image Cropping and Resizing (Ryan add this part):


##Fully Preprocessing an Image
This function will take in an image and return the preprocessed version

In [None]:
#Preprocess Image
#Recommended variable values
#ball_size = 5
#erosion_size = (4,2)
#erosion_iterations = 1
#dilation_size = (3,2)
#dilation_iterations = 6
def preprocess_image(img, ball_size, erosion_size, erosion_iterations, dilation_size, dilation_iterations):
  img = removeBackground(img, ball_size, erosion_size, erosion_iterations, dilation_size, dilation_iterations)
  img = removePectoral(img)
  #INSERT CROPPING AND RESIZING FUNCTION HERE
  return img


In [None]:
from google.colab import drive
drive.mount('/content/drive')

##Mount Google Drive:

##Process Entire Dataset:

1. Unzip data folder from drive to local runtime folder
2. For each image in the folder, preprocess the image and save it to an output folder
3. Zip output folder and save to drive

In [None]:
!unzip gdrive/MyDrive/Breast\ Cancer\ Data/rsna-breast-cancer-256-pngs.zip -d rsna-256/

Example of how we would perform the lookup

In [None]:
import pandas as pd

df = pd.read_csv('train.csv')
df = df.set_index('image_id')


#Background removal code here
image = right_orient_mammogram(image)

# Remove the pectoral muscle
# Parse the image_id from the image title

view_value = df.loc[this_image_id]['view']
if(view_value == "CC"):
  try:
    image = removePectoral(image.copy())

  except IndexError:
    pass

