# Assignment - shallow learning

Hi there! In this assignment, you will use shallow learning (including svm, random forests and gradient boosting if you feel up for the challenge) to solve an adapted Question 1 of the winter 2023 exam in applied machine learning:

## Introduction:

During the semester you have become very excited about the field of digital pathology which is an area that is developing rapidly due to advancements in microscopy imaging hardware. These advancements have allowed digitizing glass slides into whole-slide images. You have recently read the paper by [Veeling et al (2018)](https://arxiv.org/abs/1806.03962) and you are thrilled to see that the authors have derived a novel dataset, denoted PatchCamelyon (PCam), that will enable you to develop and benchmark your own machine learning models. As Veeling et al (2018) you are primarily interested in developing machine learning models that based on patches of whole-slide images of lymph node sections can assist pathologist in tumor detection.

The primary objective of this exam is to perform image classification using the PCam dataset. The full dataset consists of 327,680 color images (96x96pxs) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. For this assignment, however, you are only going to use the subset of the data which have been made available on Kaggle.

### Question 1 (adapted from the exam):
Use non-deep learning to perform image classification (tumor detection). Consider among other things the following:
1. Support vector machines
2. Random forests
3. Boosting
4. A combination of two or all three of the methods
5. Assess the importance of image resolution for the methods you are using

The assignment is posted as a Kaggle competition and is available here: https://www.kaggle.com/t/1f880200648443a3a30878d318cc6e4b


# Hints to get you started (with a very simple model)

In [12]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
from sklearn import ensemble
import tqdm
import tensorflow as tf
from sklearn.preprocessing import StandardScaler

Defining a function that grayscale, resize and flattens the image. This function might also become handy (for deep learning later) if the original images are too large for your hardware configuration.

In [6]:
def convert_sample(image):
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image,[32,32]).numpy()
    image = image.reshape(1,-1)
    return image

In [7]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [8]:
X = np.load('/content/gdrive/MyDrive/AML/L03/Xtrain.npy')
X = np.vstack(list(map(convert_sample,X)))
X = StandardScaler(with_mean=0, with_std=1).fit_transform(X)
print(f'Shape of training data features (observations,features): {X.shape}')

y = np.load('/content/gdrive/MyDrive/AML/L03/ytrain.npy')
y = y.reshape(-1,)
print(f'Shape of training data labels (observations,): {y.shape}')

Xtest = np.load('/content/gdrive/MyDrive/AML/L03/Xtest.npy')
Xtest = np.vstack(list(map(convert_sample,Xtest)))
Xtest = StandardScaler(with_mean=0, with_std=1).fit_transform(Xtest)
print(f'Shape of training data features (observations,features): {Xtest.shape}')



Shape of training data features (observations,features): (26214, 1024)
Shape of training data labels (observations,): (26214,)
Shape of training data features (observations,features): (1638, 1024)




The data is then ready to be applied for training and prediction in a shallow learning model such as the SVM classifier...below just a very very simple illustration on how to construct and train a support vector machine based on the data we have prepared. The predicted file can be submitted to Kaggle for evaluation.

In [14]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and vslidation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

n_estimators_list = [20, 200, 500]
min_samples_split_list = [15, 20]
min_samples_leaf_list = [5, 10, 15]

results = []

for n_estimators in tqdm.tqdm(n_estimators_list):
    for min_samples_split in min_samples_split_list:
        for min_samples_leaf in min_samples_leaf_list:
            rf_current = ensemble.RandomForestClassifier(
                n_estimators=n_estimators,
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf,
                )
            rf_current.fit(X_train, y_train)
            y_val_hat = rf_current.predict(X_val)
            accuracy = accuracy_score(y_val, y_val_hat)

            results.append([accuracy, n_estimators, min_samples_split, min_samples_leaf])

results = pd.DataFrame(results)
results.columns = ['Accuracy', 'n_estimators', 'min_samples_split', 'min_samples_leaf']
print(results)

100%|██████████| 3/3 [47:13<00:00, 944.58s/it] 

    Accuracy  n_estimators  min_samples_split  min_samples_leaf
0   0.752241            20                 15                 5
1   0.753767            20                 15                10
2   0.753767            20                 15                15
3   0.755293            20                 20                 5
4   0.749762            20                 20                10
5   0.754148            20                 20                15
6   0.777990           200                 15                 5
7   0.774366           200                 15                10
8   0.773794           200                 15                15
9   0.775510           200                 20                 5
10  0.772268           200                 20                10
11  0.766927           200                 20                15
12  0.779134           500                 15                 5
13  0.773603           500                 15                10
14  0.770551           500              




In [15]:
# Extract best parameters
results[results['Accuracy'] == results['Accuracy'].max()]

Unnamed: 0,Accuracy,n_estimators,min_samples_split,min_samples_leaf
15,0.783139,500,20,5


In [16]:
# Initialize the final model
clf = RandomForestClassifier(n_estimators=500, min_samples_split=20, min_samples_leaf=5, random_state=42)

# Train the final model
clf.fit(X, y)

# Predict the final model
y_test_hat = clf.predict(Xtest)

In [17]:
ytest_hat = pd.DataFrame({
    'Id': list(range(len(y_test_hat))),
    'Predicted': y_test_hat.reshape(-1,),
})
ytest_hat.to_csv('/content/gdrive/MyDrive/AML/L03/ytest_hat.csv', index=False)