## MACHINE LEARNING TO AID FACILE BREAST CANCER DIAGNOSIS

### Motivation for Project and Introduction:

   One in 8 US women (~12.4%) will develop an invasive form of breast cancer over the course of her lifetime. In 2018, 
an estimated >200000 new cases of invasive breast cancer is predicted along with over 60000 new-cases of non-invasive 
breast cancer. [1]
Are men immune to to breast cancer? The answer is no. According to breastcancer.org, ~2500 new cases of invasive breast
cancer will be diagnosed in men by 2018.This is equivalent to 1 in 1000 male with a potential for having the disease.
Besides skin-cancer, breast cancer is the most commonly diagnosed cancer with African-american women more likely at risk 
and gender along with age contributing to the risk-factor.

There are multiple approaches currently available for breast tissue evaluation for cancer cells. All of the approaches involves the use of low-level ionizing x-ray radiation on breast tissue. After a breast exam is done, the expertise of a radiologist is required to convert the mammogram image into an actionable item - Cancer vs no-cancer vs need-further-investigation. Our goal in this project is to demonstrate that a semi, well-trained, Convolutional Neural Network(CNN) can be made to give, cancer along with other malignant tissues, a good detection rate at levels that meet those of well-trained radiologists.[2][3][4] 


![Breast_Cancer_Incidence_Rate](Breast-Cancer-Incidence-Worldwide.jpg "Breast Cancer")


### Python Web-Scrapping from USF for Downloading Lossless JPEG files

The mammogram data files used for this ML project were downloaded from the University of South Florida digital 
mammography homepage. (http://marathon.csee.usf.edu/Mammography/Database.html) 
The digital mammography homepage consists of 2600*4 lossless jpeg files and their labels. Each mammogram picture
of a subject consists of both RIGHT_CC(Right Craniocaudal), LEFT_CC (Left Craniocaudal), RIGHT_MLO (Right mediolateral oblique)and LEFT_MLO (Left mediolateral oblique)images. 

Unlike other ML projects were one already has a pre-processed dataset, the mammogram data files used for this project
were in their raw format and thus there was over 3 weeks of hard-work that was expended on feature engineering.

The thought process for feature engineering was broken down into several pieces as highlighted below:
    
    Part (I): Label abstraction from the USF web-page with a python-html web-crawling algorithm.
    Part (II): Feature(lossless JPEG) abstraction from the USF ftp webpage using python-ftp web-crawling algorithm.
    Part (III): Feature(lossless JPEG) transformation from lossless JPEG to png for Machine learning. (A quite     
                difficult,messy, and very slow step). Multiple programming syntaxes were used for this 
                transformation step, including 3rd party programs, command line bash scripting etc.
    Part (IV): Final matching of labels and transformed features for Machine learning. 

After the mammogram images were transformed to png files we stumbled on an issue that the image files all had 
different pixel width and pixel height. For convolution in concvolutional neural networks, one of the input parameters is the pixel width, pixel height and RGB color (i.e a 3 dimensional array); all picture files supplied to the convolutional network must have the similar dimensions. 

Shown below is a raw image file of the LEFT_CC_Image post conversion from lossless JPEG to png. The file as evident has a very high resolution. However, this resolution cannot be fed into a computer as this would beome a gargantuan matrix/tensor of information that the computer will need to optimize.
Instead the approach used was to store the image files in their native .png tranformed format and write a code that does transofrmation of the image files to the right dimension prior to machine learning.



![LEFT_CC IMAGE](B_3001_1.LEFT_CC.LJPEG.png "LEFT_CC_IMAGE")

### Image Transformation and Image randomization prior to CNN Training

##### Importing all the relevant packages for Image transformation and for Machine Learning on a CNN

In [3]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import pandas as pd
import os
import time
from PIL import Image
import sys
import tensorflow as tf
import cv2
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.constraints import maxnorm
from keras.optimizers import SGD
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
from keras import backend as K
from sklearn.cross_validation import train_test_split
print("Done Uploading all Packages")

Done Uploading all Packages


##### Reading in all the feature files scrapped from the DDSM web-page

In [4]:
ken1= pd.read_csv('C:\\Users\\tralabi\\Downloads\\LCC.csv')
ken2= pd.read_csv('C:\\Users\\tralabi\\Downloads\\RCC.csv')
ken3= pd.read_csv('C:\\Users\\tralabi\\Downloads\\LMLO.csv')
ken4= pd.read_csv('C:\\Users\\tralabi\\Downloads\\RMLO.csv')

frames = [ken1, ken2, ken3, ken4]
FullFrame = pd.concat(frames)

print(len(FullFrame))

9392


#### Reading into memory all image native image files (.png) and transforming files to the right shape for CNN ML.
#### Reading into memory the corresponding labels(tags)

In [None]:
#LEFT CRANIOCAUDAL
X1L = [] #labels
X3L = [] #Images
path = "/Users/taiwoalabi/Downloads/benign_01/case3094"
os.chdir(path)
YY2 = [doc for doc in os.listdir() if doc.endswith((".LEFT_CC.LJPEG.png"))]
#YY2 = [doc for doc in os.listdir() if doc.endswith((".LEFT_CC.LJPEG.jpg"))]
for ii in range(len(FullFrame)):
    for tete in YY:
        tete1 = tete.split('.')[0] +'.'+ tete.split('.')[1]+'.'+"LJPEG"
        tete2 = tete.split('.')[0] +'.'+ tete.split('.')[1]
        if tete2 == FullFrame.iat[ii,3]:
            X1L.append(FullFrame.iat[ii,1])
            ImageName = tete1 + '.png'
            #ImageName = tete1 + ".jpg"
            #img = Image.open(ImageName)
            WIDTH = 299
            HEIGHT = 299
            #WIDTH = 350
            #HEIGHT = 350
            full_size_image = cv2.imread(ImageName)
            X3L.append(cv2.resize(full_size_image, (WIDTH,HEIGHT), interpolation=cv2.INTER_CUBIC))
            #X2.append(array(img))

#### Transforming the List(image arrays) into an array of array and normalizing the array; Transforming the labels

In [None]:
X4L = np.array(X3L)
X4L = X4L.astype('float32')
X7L = X4L/255

# Transforming the feature labels
X5L = []
for ii in range(len(X1)):
    if X1[ii] == 'Cancer':
        X5L.append(1)
    elif X1[ii] == 'No_Cancer':
        X5L.append(0)

X6L = np.array(X5L)
y = np_utils.to_categorical(X6L)

num_classes = y.shape[1]


Print("Transformation is done")

# Randomization and splitting of the dataset into a training dataset and a testing dataset
Xtrain, Xtest, ytrain, ytest = train_test_split( X7L,y, test_size = 0.15, random_state =0)

print("The shape of Xtrain[0] is: ", Xtrain[0].shape)
print("The shape of ytain[0] is: ", ytain[0].shape)

#### Generating the Convolutional Neural Network with Keras and Tensor Flow

In [None]:
#2 concolution
#1 Artificial Neural network with 512 nodes
#Batch size of 32
#Input shape 300x300x3
#Loss Function - Binary cross-entropy
model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(300, 300, 3), padding='same', activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Conv2D(32, (3, 3), activation='relu', padding='same', kernel_constraint=maxnorm(3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.5))
model.add(Dense(2, activation='sigmoid'))
# Compile model
epochs = 50
lrate = 0.01
decay = lrate/epochs
sgd = SGD(lr=lrate, momentum=0.9, decay=decay, nesterov=False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
print(model.summary())

In [None]:
#5 Convolution
#2 Deep ANN
#Batch size of 32
#Input image shape 300x300x3
#Loss function - Binary cross-entropy


model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(300, 300, 3), padding='same', activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Conv2D(32, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(Dropout(0.2))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(Dropout(0.2))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(1024, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Dense(2, activation='sigmoid'))
# Compilation of Model 
epochs = 10
lrate = 0.01
decay = lrate/epochs
sgd = SGD(lr=lrate, momentum=0.9, decay=decay, nesterov=False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
print(model.summary())

In [None]:
model.fit(Xtrain, ytrain, validation_data=(Xtest, ytest), epochs=epochs, batch_size=32)
# Evaluation of Model -- Maybe Needs a graphical analysis also
scores = model.evaluate(Xtest, ytest, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

### References

[1] Susan G.Komen https://ww5.komen.org/BreastCancer/Statistics.html

[2] Comparison of the Accuracy of Thermography and Mammography in the Detection of Breast Cancer - Aug 2016
    (Breast Care(Basel))
    
[3] A half-second glimpse often lets radiologists identify breast cancer cases even when viewing the mammogram
    of the opposite breast - April, 2018 (PNAS) [http://www.pnas.org/content/pnas/113/37/10292.full.pdf]
    
[4] AI algorithm uses color to better detect breast cancer - July 2016 (AuntMinnie.com) [https://www.auntminnie.com/index.aspx?sec=sup&sub=aic&pag=dis&ItemID=117752]


### Acknowledgments and Credits
[1] The Digital Database for Screening Mammography - 2001, Medical Physics Publishing

[2] Current Status of the Digital Database for Screening Mammography - 1998, Proceedings of the Fourth International
    Workshop on Digital Mammography.
