# Evaluating the Model

This notebook follows the training of the model and will give the final training, validation and test error for the model. To satisfy every requirements to run the notebook, you can follow `Evaluate-the-final-model` in the readme.

In [1]:
# Imports
import tensorboard as tb

In order to import that evaluations as a dataframe, we used the tensorboard dev api. The results of our evaluation can be seen here [Tensorboard.dev](https://tensorboard.dev/experiment/dTD0vaI3SdyRZYI4WLHRbg/#scalars). The following cell collects that data and turns it into a pandas dataframe with the run as the index and the losses + performance metrics in the columns.

The documentation for this process can be found [here](https://www.tensorflow.org/tensorboard/dataframe_api).

In [7]:
# Find the id in the URL of the tensorboard.dev webpage
experiment_id = "dTD0vaI3SdyRZYI4WLHRbg" 
# Download the data from the tensorboard project
experiment = tb.data.experimental.ExperimentFromDev(experiment_id)
# Get the scalars into a dataframe
df = experiment.get_scalars()
# reshape it to have the run as the rows
output  = df.pivot(index="run", columns="tag", values="value")
# Drop the eval row of the dataframe, this is a residual folder kept in tensorboard memory
# which contains the same information as the last evaluation performed (on test set here)
output = output.drop('eval')

In [9]:
display(output)

tag,DetectionBoxes_Precision/mAP,DetectionBoxes_Precision/mAP (large),DetectionBoxes_Precision/mAP (medium),DetectionBoxes_Precision/mAP (small),DetectionBoxes_Precision/mAP@.50IOU,DetectionBoxes_Precision/mAP@.75IOU,DetectionBoxes_Recall/AR@1,DetectionBoxes_Recall/AR@10,DetectionBoxes_Recall/AR@100,DetectionBoxes_Recall/AR@100 (large),DetectionBoxes_Recall/AR@100 (medium),DetectionBoxes_Recall/AR@100 (small),Loss/classification_loss,Loss/localization_loss,Loss/regularization_loss,Loss/total_loss
run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
test,0.57758,0.804416,0.420162,0.004719,0.82329,0.649656,0.369214,0.635027,0.658216,0.844016,0.557507,0.157143,0.223413,0.003314,0.037377,0.264104
training,0.600287,0.818538,0.414392,0.00516,0.840275,0.676805,0.393724,0.65273,0.672312,0.85308,0.558387,0.147729,0.18102,0.003023,0.037377,0.221419
validation,0.582718,0.802349,0.398904,0.007373,0.826129,0.655632,0.381782,0.640276,0.662531,0.840806,0.55184,0.156835,0.197278,0.003231,0.037377,0.237886


An interesting thing to note here is that the cost evaluated during the training was calculated on a subset of the whole training data set while this training error was calculated on the entire train set. Despite this the values are very similiar.

In general, the mAP and AR, as well as the total loss are all relatively close together when evaluated on each of the three sets. The total loss had the highest change between training and test, most likely due to it being the combination of the classification, localization and regularization loss. It is also not surprising to have a constant regularization loss as this loss depends on the weights of the network (which are the same for any used set).

If we focus on the total loss, we get these 3 values:
- test loss: 0.264104
- train loss: 0.221419
- validation loss: 0.237886

The validation and train loss are pretty close. The test loss is a bit higher (0.04 higher than the train loss). Even though this difference is very small, we could think that it is due to overfitting. According to the loss curves presented in `Monitoring-the-loss` in the README.md, the evaluation on validation set was very close to the training set during the whole training, which is not an indication of overfitting. So the fitting graph is not showing a sign of overfitting. In addition, the validation set have a loss close to the training set. As the validation set has never been used to update the weights during training (the gradients are not computed during evaluation), this shows that the model generalized well, without overfitting the data. If we focus more on the resuts of the evaluation on test set ,we notice that this difference of loss is mainly due to the classification loss which is 0.04 higher than the train set. Thus, this difference can be due to the data in the test set which can be slightly harder to classify, with harder examples.