# Visualization of Model Performance on Unseen Test Data.

In this analysis, we split the labeled data (`train/` on Kaggle) into training \& validation (80\% together), and testing set (20\%). We performed training and model selection using the 80\% partition, and applied the model to the remainder of 20\% data one-image-per-GPU.

The results below show the performance on the unseen 20\% test data. Note that the data splits are **identical**, i.e. a controlled variable. 

In [1]:
import numpy as np 
import pandas as pd
from pathlib import Path
import os
import warnings
import pickle
from sklearn import metrics;

from matplotlib import pyplot as plt
import matplotlib.image as mpimg

warnings.filterwarnings("ignore")
%matplotlib inline

def multicalss_eval(model_path):
    test_pred = pd.read_csv(RESULTS_DIR/model_path, index_col=0);
    test_pred = test_pred[~test_pred.duplicated("ImageId")];
    tmp = test_pred.merge(df_train, on=KEY, how="left"); 
    return 100 * metrics.accuracy_score(tmp["ClassID"], tmp["ClassId"]); 

In [2]:
## Constants:
KEY = "ImageId";
KAGGLE = False;
PREFIX = "/kaggle/" if KAGGLE else "../";

DATA_DIR = Path(PREFIX+'input/imaterialist-fashion-2020-fgvc7/')
IMG_DIR = Path(PREFIX+'input/imaterialist-fashion-2020-fgvc7/train/')
RESULTS_DIR = Path(PREFIX + 'results/');

## Load data:
df_train = pd.read_csv(DATA_DIR/'train.csv')
df_train = df_train[~df_train.duplicated("ImageId")];
df_train.head()

Unnamed: 0,ImageId,EncodedPixels,Height,Width,ClassId,AttributesIds
0,00000663ed1ff0c4e0132b9b9ac53f6e,6068157 7 6073371 20 6078584 34 6083797 48 608...,5214,3676,6,115136143154230295316317
9,0000fe7c9191fba733c8a69cfaf962b7,2201176 1 2203623 3 2206071 5 2208518 8 221096...,2448,2448,33,190
11,0002ec21ddb8477e98b2cbb87ea2e269,2673735 2 2676734 8 2679734 13 2682733 19 2685...,3000,1997,33,182
15,0002f5a0ebc162ecfb73e2c91e3b8f62,435 132 1002 132 1569 132 2136 132 2703 132 32...,567,400,10,108115119141155229286316317
18,0004467156e47b0eb6de4aa6479cbd15,132663 8 133396 25 134130 41 134868 53 135611 ...,750,500,10,115141155295305317


## Experiments

### Procedures

Loaded in order:

1. No fine-tuning, ResNet-50 //control or reference or baseline model
2. No fine-tuning, ResNet-101
3. With fine-tuning, ResNet-50

Only 1 experimental variable is introduced at a time. As mentioned, the scores were computed based on performance on a **common test set (n=3200)**

In [3]:
NAMES = ["saved_head_model_scratch_resnet50.csv",
         "saved_head_model_scratch_resnet101.csv",
         "saved_head_model_pretrained_resnet50.csv"];
results = []; 
for name in NAMES:
    results.append(multicalss_eval("experiments/"+name))

In [4]:
pd.DataFrame(results, index=NAMES, columns=["Score"])

Unnamed: 0,Score
saved_head_model_scratch_resnet50.csv,1.0
saved_head_model_scratch_resnet101.csv,2.78125
saved_head_model_pretrained_resnet50.csv,2.28125


### Summary

Our computational experiments show the following:

| Model Name |  Backbone  | Fine-tuned? | Score |
|:----------:|:----------:|:-----------:|:-----:|
| Control    | ResNet-50  | No          | 1.00  |
| Deeper     | ResNet-101 | No          | 2.78  |
| Finetuned  | ResNet-50  | Yes         | 2.28  |

In general, we conclude:

* Fine-tuning with weights is helpful
* Deeper architecture as backbone is more suitable for dataset with complexity & diversity as ours