# Assignment - shallow learning

Hi there! In this assignment, you will use shallow learning (including svm, random forests and gradient boosting if you feel up for the challenge) to solve an adapted Question 1 of the winter 2023 exam in applied machine learning:

## Introduction:

During the semester you have become very excited about the field of digital pathology which is an area that is developing rapidly due to advancements in microscopy imaging hardware. These advancements have allowed digitizing glass slides into whole-slide images. You have recently read the paper by [Veeling et al (2018)](https://arxiv.org/abs/1806.03962) and you are thrilled to see that the authors have derived a novel dataset, denoted PatchCamelyon (PCam), that will enable you to develop and benchmark your own machine learning models. As Veeling et al (2018) you are primarily interested in developing machine learning models that based on patches of whole-slide images of lymph node sections can assist pathologist in tumor detection.

The primary objective of this exam is to perform image classification using the PCam dataset. The full dataset consists of 327,680 color images (96x96pxs) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. For this assignment, however, you are only going to use the subset of the data which have been made available on Kaggle.

### Question 1 (adapted from the exam):
Use non-deep learning to perform image classification (tumor detection). Consider among other things the following:
1. Support vector machines
2. Random forests
3. Boosting
4. A combination of two or all three of the methods
5. Assess the importance of image resolution for the methods you are using

The assignment is posted as a Kaggle competition and is available here: https://www.kaggle.com/t/1f880200648443a3a30878d318cc6e4b


# Hints to get you started (with a very simple model)

In [1]:
from sklearn import svm
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn import tree
import tqdm
import tensorflow as tf
from sklearn.preprocessing import StandardScaler

Defining a function that grayscale, resize and flattens the image. This function might also become handy (for deep learning later) if the original images are too large for your hardware configuration.

In [2]:
def convert_sample(image):
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image,[32,32]).numpy()
    image = image.reshape(1,-1)
    return image

In [3]:
X = np.load('Xtrain.npy') 
X = np.vstack(list(map(convert_sample,X)))
X = StandardScaler(with_mean=0, with_std=1).fit_transform(X)
print(f'Shape of training data features (observations,features): {X.shape}')

y = np.load('ytrain.npy') 
y = y.reshape(-1,)
print(f'Shape of training data labels (observations,): {y.shape}')

Xtest = np.load('Xtest.npy') 
Xtest = np.vstack(list(map(convert_sample,Xtest)))
Xtest = StandardScaler(with_mean=0, with_std=1).fit_transform(Xtest)
print(f'Shape of training data features (observations,features): {Xtest.shape}')



Shape of training data features (observations,features): (26214, 1024)
Shape of training data labels (observations,): (26214,)
Shape of training data features (observations,features): (1638, 1024)




The data is then ready to be applied for training and prediction in a shallow learning model such as the SVM classifier...below just a very very simple illustration on how to construct and train a support vector machine based on the data we have prepared. The predicted file can be submitted to Kaggle for evaluation.

In [4]:
# Hyperparameter searching
from sklearn.model_selection import train_test_split

# Split the dataset into training and vslidation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

min_samples_split_list = [5, 10] 
min_samples_leaf_list = [5, 10] 
max_features_list = [5, 10] 

results = []

for min_samples_split in tqdm.tqdm(min_samples_split_list):
    for min_samples_leaf in min_samples_leaf_list:
        for max_features in max_features_list:
            dt_current = tree.DecisionTreeClassifier(
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf,
                max_features=max_features,
                )
            dt_current = dt_current.fit(X_train, y_train)
            y_val_hat = dt_current.predict(X_val)
            accuracy = accuracy_score(y_val_hat, y_val)

            results.append([accuracy, min_samples_split, min_samples_leaf, max_features])

results = pd.DataFrame(results)
results.columns = ['Accuracy', 'min_samples_split', 'min_samples_leaf', 'max_features']
print(results)

100%|████████████████████████████████████████████████████████████████████████████████| 3/3 [7:36:00<00:00, 9120.02s/it]


    Accuracy  n_estimators  min_samples_split  min_samples_leaf
0   0.750715            50                 15                 5
1   0.745566            50                 15                10
2   0.747473            50                 15                15
3   0.750715            50                 20                 5
4   0.745566            50                 20                10
5   0.747473            50                 20                15
6   0.750715            50                 25                 5
7   0.745566            50                 25                10
8   0.747473            50                 25                15
9   0.771696           200                 15                 5
10  0.772840           200                 15                10
11  0.768644           200                 15                15
12  0.771886           200                 20                 5
13  0.772840           200                 20                10
14  0.768644           200              

In [5]:
# Extract best parameters
min_samples_split = results.loc[results['Accuracy'].idxmax()]['min_samples_split'].astype(int)
min_samples_leaf = results.loc[results['Accuracy'].idxmax()]['min_samples_leaf'].astype(int)
max_features = results.loc[results['Accuracy'].idxmax()]['max_features'].astype(int)

In [6]:
print(min_samples_split)
print(min_samples_leaf)
print(max_features)

500
25
5


In [7]:
# Initialize your final model
dt_optimized = tree.DecisionTreeClassifier(
    min_samples_split=min_samples_split,
    min_samples_leaf=min_samples_leaf,
    max_features=max_features,
    )

dt_optimized.fit(X, y)

y_test_hat = dt_optimized.predict(Xtest)

In [8]:
ytest_hat = pd.DataFrame({
    'Id': list(range(len(y_test_hat))),
    'Predicted': y_test_hat.reshape(-1,),
})
ytest_hat.to_csv('ytest_hat.csv', index=False)