Image classification with CNNs
================

The goal of this exercise is to implement a specific CNN architecture with PyTorch and train it on the CIFAR-10 image classification dataset. We will start by introducing the dataset and then implement a `nn.Module` and a useful `Solver` class. Seperating the model from the actual training has proven itself as a sensible design decision. By the end of this exercise you should have succesfully trained your (possible) first CNN model and have a boilerplate `Solver` class which you can reuse for the next exercise and your future research projects.

For an inspiration on how to implement a model or the solver class you can have a look at [these](https://github.com/pytorch/examples) PyTorch examples.

In [12]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import pickle
from torch.autograd import Variable
from data_utils import read_cancer_dataset
from data_utils import data_augmentation
from data_utils import norm_split_data
from data_utils import devide_dataset_in_class_folders_and_duplicate_small_classes
from data_utils import append_augmented_data


csv_full_name = '/Users/yuminsun/dl4cvproject/data/train.csv'
img_folder_full_name = '/Users/yuminsun/dl4cvproject/data/train256'

# %matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Cancer Dataset
=========

Since the focus of this exercise should be neural network models and how to successfully train them, we provide you with preprocessed and prepared datasets. For an even easier management of the train, validation and test data pipelines we provide you with custom `torch.utils.data.Dataset` classes. Use the official [documentation](http://pytorch.org/docs/data.html) to make yourself familiar with the `Dataset` and `DataLoader` classes. Think about how you have to integrate them in your training loop and have a look at the data preprocessing steps in `dl4cv/data_utils.py`.

The `num_workers` argument of the `DataLoader` class allows you to preprocess data with multiple threads.

<div class="alert alert-info">
    <h3>Note</h3>
    <p>In this case we generated the `Dataset` classes after we applied all the preprocessing steps. Other datasets or random data augmentation might require an online preprocessing which can be integrated into the `Dataset` classes. See `torchvision.Transform` for examples.</p>
</div>

In [None]:
data,class_statistics,class_fractions,img_name_list,bad_img_name_list = read_cancer_dataset(csv_full_name=csv_full_name,img_folder_full_name=img_folder_full_name)

# print('store in local...')

# with open('raw_data.pickle', 'wb') as f_raw:
#     # Pickle the 'data' dictionary using the highest protocol available.
#     pickle.dump(raw_data, f_raw, pickle.HIGHEST_PROTOCOL)
    
# print('OK...')

  1%|          | 128/18577 [00:00<01:28, 208.08it/s]

bad image  scan_00010127.png


  3%|▎         | 628/18577 [00:03<01:29, 200.02it/s]

bad image  scan_00010909.png


  8%|▊         | 1493/18577 [00:08<01:32, 184.62it/s]

bad image  scan_00012200.png


 11%|█▏        | 2103/18577 [00:11<01:32, 178.99it/s]

bad image  scan_00013156.png


 14%|█▍        | 2668/18577 [00:16<01:39, 160.50it/s]

bad image  scan_00014013.png


 16%|█▌        | 2977/18577 [00:18<01:39, 157.27it/s]

bad image  scan_00014456.png


 16%|█▋        | 3058/18577 [00:19<01:40, 154.11it/s]

bad image  scan_00014562.png


 17%|█▋        | 3150/18577 [00:21<01:42, 149.93it/s]

bad image  scan_00014710.png


 18%|█▊        | 3384/18577 [00:23<01:46, 142.70it/s]

bad image  scan_00015073.png


 20%|██        | 3751/18577 [00:26<01:45, 140.24it/s]

bad image  scan_00015624.png


 24%|██▍       | 4423/18577 [00:32<01:44, 135.91it/s]

bad image  scan_00016637.png


 26%|██▋       | 4908/18577 [00:36<01:42, 133.15it/s]

bad image  scan_00017386.png


 29%|██▉       | 5358/18577 [00:41<01:43, 127.57it/s]

bad image  scan_00018060.png


 29%|██▉       | 5420/18577 [00:42<01:43, 126.61it/s]

bad image  scan_00018170.png


 32%|███▏      | 6010/18577 [00:47<01:38, 127.28it/s]

bad image  scan_00019058.png


 33%|███▎      | 6094/18577 [00:48<01:38, 126.57it/s]

bad image  scan_00019183.png


 33%|███▎      | 6150/18577 [00:48<01:38, 126.27it/s]

bad image  scan_00019278.png
bad image  scan_00019285.png


 34%|███▍      | 6366/18577 [00:51<01:38, 124.26it/s]

bad image  scan_00019605.png


 34%|███▍      | 6399/18577 [00:51<01:38, 123.81it/s]

bad image  scan_00019651.png


 36%|███▋      | 6774/18577 [00:54<01:34, 124.66it/s]

bad image  scan_00020202.png


 39%|███▉      | 7206/18577 [00:57<01:30, 125.14it/s]

In [45]:
devide_dataset_in_class_folders_and_duplicate_small_classes(data,old_fractions,statistics)

In [None]:
X_train,y_train,X_val,y_val,X_test,y_test = norm_split_data(aug_data)

print('Store good, augment, norm, split data in local...')

with open('data.pickle', 'wb') as f_final:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(data, f_final, pickle.HIGHEST_PROTOCOL)
    
print('OK...')


In [None]:
# with open('aug_data.pickle', 'rb') as f:
#     # The protocol version used is detected automatically, so we do not
#     # have to specify it.
#     loaddata = pickle.load(f)

Visualize Examples
------------------

To make yourself familiar with the dataset we visualize some examples. We show a few examples from each class. Note that we have to revert (transposition and mean subtraction) some preprocessing steps.

In [None]:
aug_data = data_augmentation(data,fractions)

# print('Store augment data in local...')

# with open('aug_data.pickle', 'wb') as f_aug:
#     # Pickle the 'data' dictionary using the highest protocol available.
#     pickle.dump(aug_data, f_aug, pickle.HIGHEST_PROTOCOL)
    
# print('OK...')

## Model Architecture and Forward Pass 

After you understood the core concepts of PyTorch and have a rough idea on how to implement your own model, complete the initialization and forward methods of the `ClassificationCNN` in the `dl4cv/classifiers/classification_cnn.py` file. Note that we do not have to implement a backward pass since this is automatically handled by the `autograd` package.

Use the cell below to check your results:

## Training and Validation with the Solver
We train and validate our previously generated model with a seperate `Solver` class defined in `dl4cv/solver.py`. Complete the `.train()` method and try to come up with an efficient iteration scheme as well as an informative training logger.

Use the cells below to test your solver. A nice trick is to train your model with just a few training samples. You should be able to overfit small datasets, which will result in very high training accuracy and comparatively low validation accuracy.

<div class="alert alert-info">
    <h3>Note</h3>
    <p>As seen below, the design of our `Solver` class is indepdenent of the particular model or data pipeline. This facilitates the reuse of the class and its modular structure allows the training of different models.</p>
</div>

Plotting the loss, training accuracy, and validation accuracy should show clear overfitting:

## Train the Network
Now train your model with the full dataset. By training a `ThreeLayerCNN` model for one epoch, you should already achieve greater than 40% accuracy on the validation set. If your training is painfully slow check if you did not forget to call the `nn.Module.cuda()` method.

For the overfitting example we provided you with a set of hyperparamters (`hidden_dim`, `lr`, `weight_decay`, ...). You can start with the same parameter values but in order to maximize your accuracy you should try to train multiple models with different sets of hyperparamters. This process is called hyperparameter optimization.

In [None]:
from yz.classifiers.classification_cnn import ClassificationCNN
from yz.classifiers.transferred_alexnet import alexnet
from yz.solver import Solver
from yz.data_utils import get_balanced_weights
from torchvision import models
import torch.nn as nn

weights = get_balanced_weights(train_label_list, 14)
#sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))
train_loader = torch.utils.data.DataLoader(train_data, batch_size=40, shuffle=False, num_workers=8)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=40, shuffle=False, num_workers=8)

model = alexnet(pretrained=True)
#model.classifier  = nn.Sequential(
#            nn.Linear(12544, 4096),
#            nn.ReLU(inplace=True),
#            nn.Linear(4096, 14),
#)
#model.classifier = nn.Sequential(
#            nn.Linear(12544, 4096),
#            nn.ReLU(inplace=True),
#            nn.Linear(4096, 4096),
#            nn.ReLU(inplace=True),
#            nn.Linear(4096, 14)
#)

#list(model.classifier.children())[:-1] = nn.Linear(4096, 14)  
#for param in model.features.parameters():
#    param.requires_grad = False
solver = Solver()
solver.train(model, train_loader, val_loader, log_nth=1, num_epochs=2)

# Test your Model
Run your best model on the test set. You should easily achieve a score above 10% (random guessing for a classification task with 10 classes) accuracy on the given test set:

In [None]:
test_loader = torch.utils.data.DataLoader(test_data, batch_size=50, shuffle=False, num_workers=4)

scores = []
for inputs, target in tqdm(test_loader):
    #print(type(target))
    inputs, targets = Variable(inputs), Variable(target)
    #if model.is_cuda:
    #    inputs, targets = inputs.cuda(), targets.cuda()

    outputs = model(inputs)
    _, preds = torch.max(outputs, 1)
    scores.extend((preds == targets).data.cpu().numpy())
    
print('Test set accuracy: %f' % np.mean(scores))

## Get final test data

In [None]:
from yz.data_utils import get_Cancer_datasets
csv_full_name = '/home/hpc/pr92no/ga42cih2/Projects/dl4cvproject/data/test.csv'
img_folder_full_name = '/home/hpc/pr92no/ga42cih2/Projects/dl4cvproject/data/test_400'
test_X, csv_test = get_Cancer_datasets(csv_full_name=csv_full_name,img_folder_full_name=img_folder_full_name, mode='upload')

In [None]:
v = csv_test['detected'].values

In [None]:
print(type(csv_test))
print(test_X.size())

In [None]:
try:
    del csv_test['age']
except KeyError as e:
    print(e)
try:
    del csv_test['gender']
except KeyError as e:
    print(e)
try:
    del csv_test['view_position']
except KeyError as e:
    print(e)
try:
    del csv_test['image_name']
except KeyError as e:
    print(e)
try:
    del csv_test['detected']
except KeryError as e:
    print(e)

print(list(csv_test))

In [None]:
inputs = test_X[1000:1020]
#if model.is_cuda:
#        inputs = inputs.cuda()
outputs = model(inputs)
_, preds = torch.max(outputs, 1)

## Prediction and Submission CSV

In [None]:
import pandas as pd
index = 0
jump = 30
detected = []
pred_set = set()
for i in tqdm(range(int(test_X.size()[0] / jump) + 1)):
    start = index
    end = index + jump
    if end >= (test_X.size()[0]) :
        end = test_X.size()[0]
    inputs = test_X[start:end]
    # if model.is_cuda:
    #     inputs = inputs.cuda()
    outputs = model(inputs)
    _, preds = torch.max(outputs, 1)
    ###
    int_list_preds = preds.data.cpu().numpy().tolist()
    for pred_num in int_list_preds:
        pred_set.add(pred_num + 1)
    str_list_preds = [('class_' + str(pred_num + 1)) for pred_num in int_list_preds]
    detected.extend(str_list_preds)
    ####
    if end == test_X.size()[0]:
        break
    index += jump

In [None]:
print(pred_set)
csv_test['detected'] = pd.Series(detected)
csv_test.to_csv('submission.csv', index=False)
print(csv_test)

model.save("models/classification_cnn.model")