# Assignment - shallow learning

Hi there! In this assignment, you will use shallow learning (including svm, random forests and gradient boosting if you feel up for the challenge) to solve an adapted Question 1 of the winter 2023 exam in applied machine learning:

## Introduction:

During the semester you have become very excited about the field of digital pathology which is an area that is developing rapidly due to advancements in microscopy imaging hardware. These advancements have allowed digitizing glass slides into whole-slide images. You have recently read the paper by [Veeling et al (2018)](https://arxiv.org/abs/1806.03962) and you are thrilled to see that the authors have derived a novel dataset, denoted PatchCamelyon (PCam), that will enable you to develop and benchmark your own machine learning models. As Veeling et al (2018) you are primarily interested in developing machine learning models that based on patches of whole-slide images of lymph node sections can assist pathologist in tumor detection. 

The primary objective of this exam is to perform image classification using the PCam dataset. The full dataset consists of 327,680 color images (96x96pxs) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. For this assignment, however, you are only going to use the subset of the data which have been made available on Kaggle.

### Question 1 (adapted from the exam):
Use non-deep learning to perform image classification (tumor detection). Consider among other things the following:
1. Support vector machines
2. Random forests
3. Boosting
4. A combination of two or all three of the methods
5. Assess the importance of image resolution for the methods you are using

The assignment is posted as a Kaggle competition and is available here: https://www.kaggle.com/t/1f880200648443a3a30878d318cc6e4b


# Hints to get you started (with a very simple model)

In [1]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
import tqdm
import tensorflow as tf
from sklearn.preprocessing import StandardScaler

Defining a function that grayscale, resize and flattens the image. This function might also become handy (for deep learning later) if the original images are too large for your hardware configuration.

In [2]:
def convert_sample(image):
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image,[32,32]).numpy()
    image = image.reshape(1,-1)
    return image

In [3]:
X = np.load('Xtrain.npy') 
X = np.vstack(list(map(convert_sample,X)))
X = StandardScaler(with_mean=0, with_std=1).fit_transform(X)
print(f'Shape of training data features (observations,features): {X.shape}')

y = np.load('ytrain.npy') 
y = y.reshape(-1,)    
print(f'Shape of training data labels (observations,): {y.shape}')

Xtest = np.load('Xtest.npy') 
Xtest = np.vstack(list(map(convert_sample,Xtest)))
Xtest = StandardScaler(with_mean=0, with_std=1).fit_transform(Xtest)
print(f'Shape of training data features (observations,features): {Xtest.shape}')



Shape of training data features (observations,features): (26214, 1024)
Shape of training data labels (observations,): (26214,)
Shape of training data features (observations,features): (1638, 1024)




The data is then ready to be applied for training and prediction in a shallow learning model such as the SVM classifier...below just a very very simple illustration on how to construct and train a support vector machine based on the data we have prepared. The predicted file can be submitted to Kaggle for evaluation.

In [4]:
# Dimensionality reduction with PCA
from sklearn.decomposition import PCA
n_components = 100  # Adjust this value as needed
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
Xtest_pca = pca.transform(Xtest)

In [5]:
# Hyperparameter searching
from sklearn.model_selection import train_test_split, GridSearchCV

# Split the dataset into training and vslidation sets
X_train, X_val, y_train, y_val = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Define the SVM model
# classifier = svm.SVC()

C_list = [0.1, 1, 10, 100, 1000] 
# degree_list = [2, 3, 4] 
gamma_list = ['scale', 'auto', 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] 

results = []

# kernel = 'poly'
# gamma_na = 'N/A'
# for C in tqdm.tqdm(C_list):
#     for degree in degree_list:
#         svm_current_poly = svm.SVC(
#             kernel=kernel, 
#             C=C, 
#             degree=degree
#         )
#         svm_current_poly.fit(X_train, y_train)
#         y_val_hat_poly = svm_current_poly.predict(X_val)
#         accuracy_poly = accuracy_score(y_val, y_val_hat_poly)
        
#         results.append([accuracy, kernel, C, degree, gamma_na])
        
kernel = 'rbf'
degree_na = 'N/A'
for C in tqdm.tqdm(C_list):
    for gamma in gamma_list:
        svm_current_rbf = svm.SVC(
            kernel=kernel, 
            C=C, 
            gamma=gamma
        )
        svm_current_rbf.fit(X_train, y_train)
        y_val_hat_rbf = svm_current_rbf.predict(X_val)
        accuracy_rbf = accuracy_score(y_val, y_val_hat_rbf)
        results.append([accuracy_rbf, kernel, C, degree_na, gamma])

results = pd.DataFrame(results)
results.columns = ['Accuracy', 'Kernel', 'C', 'Degree', 'Gamma']
print(results)

100%|████████████████████████████████████████████████████████████████████████████████| 5/5 [5:04:14<00:00, 3650.84s/it]


    Accuracy Kernel       C Degree  Gamma
0   0.677093    rbf     0.1    N/A  scale
1   0.668701    rbf     0.1    N/A   auto
2   0.526035    rbf     0.1    N/A    0.1
3   0.496471    rbf     0.1    N/A    0.2
4   0.496471    rbf     0.1    N/A    0.3
5   0.496471    rbf     0.1    N/A    0.4
6   0.496471    rbf     0.1    N/A    0.5
7   0.496471    rbf     0.1    N/A    0.6
8   0.496471    rbf     0.1    N/A    0.7
9   0.496471    rbf     0.1    N/A    0.8
10  0.496471    rbf     0.1    N/A    0.9
11  0.687965    rbf     1.0    N/A  scale
12  0.709708    rbf     1.0    N/A   auto
13  0.574290    rbf     1.0    N/A    0.1
14  0.571619    rbf     1.0    N/A    0.2
15  0.571619    rbf     1.0    N/A    0.3
16  0.571619    rbf     1.0    N/A    0.4
17  0.571619    rbf     1.0    N/A    0.5
18  0.571619    rbf     1.0    N/A    0.6
19  0.571619    rbf     1.0    N/A    0.7
20  0.571619    rbf     1.0    N/A    0.8
21  0.571619    rbf     1.0    N/A    0.9
22  0.703223    rbf    10.0    N/A

In [6]:
# Extract best result
results[results['Accuracy'] == results['Accuracy'].max()]

Unnamed: 0,Accuracy,Kernel,C,Degree,Gamma
12,0.709708,rbf,1.0,,auto


In [8]:
# Initialize the final model
svm_optimal = svm.SVC(kernel='rbf', C = 1, gamma = 'auto') # update the params based on best params
svm_optimal.fit(X, y)
y_test_hat = svm_optimal.predict(Xtest)

In [9]:
ytest_hat = pd.DataFrame({
    'Id': list(range(len(y_test_hat))),
    'Predicted': y_test_hat.reshape(-1,),
})
ytest_hat.to_csv('ytest_hat.csv', index=False) #/mnt/c/Users/cmd/Dropbox/Teaching/amlFall2023/assignments/ytest_hat.csv