# Project 2
by:
- Rebecca Kuhlman
- Michael Amberg
- Sam Yao

## Business Understanding
Identifying the type of brain tumor a patient has is an important step in figuring out the treatment plan of a patient. They can be diagnosed via MRI imaging, leading to interest in using machine learning to diagnose the patient. Having a second opinion on brain tumor diagnoses would help improve patient care and outcomes, and lessen stress on doctors. A machine learning model could also speed up analysis time and pick out which patients are in need of urgent treatment.

In this dataset, there is glioma, meningioma, and pituitary tumors, as well as MRI images with no tumors.
Glioma tumors are usually malignant, while meningioma and pituitary tumors are usually benign. Different types of tumors are made of different types of cells and have a location where they are most likely to be located.
More information can be found at: https://www.mayoclinic.org/diseases-conditions/brain-tumor/symptoms-causes/syc-20350084

There are many other types of tumors that future algorithms will be need to address. The majority of other types of tumors are more common in children, while the set we are dealing with are all adult brain images.

Because the model deals with health conditions that have extreme affects on the patient, model accuracy is extremely important. Furthermore, accuracy must fine-tuned to avoid fatal misdiagnosis. While incorrectly marking a patient with a benign tumor as malignant is wasteful, the adverse affects are minimal. Inversely, misdiagnosing a malignant tumor as benign may have fatal effects for the patient. Therefore, the designed model must minimize the rate of false negatives with accuracy of 95% or more.

It should be noted that the majority of misdiagnose of brain tumors happen before a brain scan or related test is ordered.
https://paulandperkins.com/brain-tumors/


## Data Preparation

Several helpful sources that helped this part of the section include:
- [1] https://pillow.readthedocs.io/en/stable/handbook/tutorial.html
- [2] https://towardsdatascience.com/loading-custom-image-dataset-for-deep-learning-models-part-1-d64fa7aaeca6

In [1]:
import pandas as pd
import numpy as np
import os
from PIL import Image # Utilized Source [2]

img = Image.open("./Training/glioma_tumor/gg (1).jpg") # Utilized Source [1]
img_arr = np.array(img)
new_arr = list()
for x in img_arr:
    for y in x:
        new_arr.append(y)
print(len(new_arr))
# This method creates the data, whether training or testing, in the form we desire
# Uses code from source [2] to create the training datasets
def create_dataset(img_folder):
    # Read through all files in "./Training"
    img_data_array=[]
    class_name=[]
    for dir1 in os.listdir(img_folder):
        for file in os.listdir(os.path.join(img_folder, dir1)):
            image_path= os.path.join(img_folder, dir1,  file)
            image= np.array(Image.open(image_path))
            image = np.resize(image, (1,262144,3)) #Vectorizes each image
            image = image.astype('float32')
            #image /= 255  
            img_data_array.append(image)
            class_name.append(dir1)
    # return array with training data.
    return img_data_array, class_name

262144


In [2]:
df_training, training_classes = create_dataset("./Training")
df_testing, testing_classes = create_dataset("./Testing")

In [3]:
df_training[0].shape

(1, 262144, 3)

## Data Reduction

PCA

In [16]:
# get some of the specifics of the dataset
dfTrain = pd.DataFrame.from_records(df_training, training_classes)
X = dfTrain
y = training_classes

n_samples, n_features = X.shape
_, h, w = img_arr.shape
n_classes = 4

print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
print("Original Image Sizes {} by {}".format(h,w))

n_samples: 2870
n_features: 1
n_classes: 4
Original Image Sizes 512 by 3


In [None]:
# lets do some PCA of the features and go from 1850 features to 20 features
from sklearn.decomposition import PCA

n_components = 300
print ("Extracting the top %d eigenfaces from %d faces" % (
    n_components, X.shape[0]))

pca = PCA(n_components=n_components)
%time pca.fit(X.copy())
eigenfaces = pca.components_.reshape((n_components, h, w))

In [None]:
def reconstruct_image(trans_obj,org_features):
    low_rep = trans_obj.transform(org_features)
    rec_image = trans_obj.inverse_transform(low_rep)
    return low_rep, rec_image

idx_to_reconstruct = 1
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))

 randomized principle components analysis. Visualize the explained variance of each component. Analyze how many dimensions are required to adequately represent your image data. Explain your analysis and conclusion.

In [None]:
print ("Extracting the top %d eigenfaces from %d faces" % (
    n_components, X.shape[0]))

rpca = PCA(n_components=n_components, svd_solver='randomized')
%time rpca.fit(X.copy())
eigenfaces = rpca.components_.reshape((n_components, h, w))

Compare the representation using PCA and Randomized PCA. The method you choose to compare dimensionality methods should quantitatively explain which method is better at representing the images with fewer components.  Do you prefer one method over another? Why?

feature extraction upon the images using DAISY. Try different parameters for your image data.

In [3]:
from skimage.feature import daisy

# lets first visualize what the daisy descriptor looks like
features, img_desc = daisy(img,
                           step=20,
                           radius=20,
                           rings=2,
                           histograms=8,
                           orientations=8,
                           visualize=True)
imshow(img_desc)
plt.grid(False)

In [None]:
# now let's understand how to use it
features = daisy(img, step=20, radius=20, rings=2, histograms=8, orientations=4, visualize=False)
print(features.shape)
print(features.shape[0]*features.shape[1]*features.shape[2])

In [None]:
# create a function to take in the row of the matrix and return a new feature
def apply_daisy(row,shape):
    feat = daisy(row.reshape(shape), step=20, radius=20,
                 rings=2, histograms=8, orientations=4,
                 visualize=False)
    return feat.reshape((-1))

%time test_feature = apply_daisy(X[3],(h,w))
test_feature.shape

In [None]:
import copy
# find closest image to current image
idx1 = 5
distances = copy.deepcopy(dist_matrix[idx1,:])
distances[idx1] = np.infty # dont pick the same image!
idx2 = np.argmin(distances)

plt.figure(figsize=(7,10))
plt.subplot(1,2,1)
imshow(X[idx1].reshape((h,w)))
plt.title("Original Image")
plt.grid()

plt.subplot(1,2,2)
imshow(X[idx2].reshape((h,w)))
plt.title("Closest Image")
plt.grid()

Does this feature extraction method show promise for your prediction task? Why?
Use visualizations to analyze this questions. For example, use a heat map of the pairwise differences (ordered by class) among all extracted features. Another option, build a nearest neighbor classifier to see actual classification performance.

## Exceptional Work 😡

Additional feature extraction techniques(Gabor filters, keypoint matching, ordered gradients) Several are provided in the notebooks and you might research techniques known in the computer vision literature.
Does this feature extraction method show promise for your prediction task? Why?
Use visualizations to analyze this questions. For example, use a heat map of the pairwise differences (ordered by class) among all extracted features. Another option, build a nearest neighbor classifier to see actual classification performance.