# Evaluation

In [2]:
import json

from pandas import concat, set_option

from data import file
from reporting.evaluation import plot_confusion_matrix, percentage_true_positives

## Load Evaluation Data
Within each model I generated and stored some validation data that can be used here.
I decided to not store any statistics but just two series of predicted and expected labels:
```
{
"expected": ["International", "Etat", ...],
"predicted": ["International", "Etat", ...]
}```

Afterwards I can generate any statistic out of this n predictions here.

In [3]:
with open(file.reporting_data_report_tfidf, 'r') as f:
    tfidf_raw = json.load(f)
    tfidf_expected = tfidf_raw['expected']
    tfidf_predicted = tfidf_raw['predicted']

with open(file.reporting_data_report_cnn, 'r') as f:
    cnn_raw = json.load(f)
    cnn_expected = cnn_raw['expected']
    cnn_predicted = cnn_raw['predicted']

with open(file.reporting_data_report_rnn, 'r') as f:
    rnn_raw = json.load(f)
    rnn_expected = rnn_raw['expected']
    rnn_predicted = rnn_raw['predicted']

with open(file.reporting_data_report_bert, 'r') as f:
    bert_raw = json.load(f)
    bert_expected = bert_raw['expected']
    bert_predicted = bert_raw['predicted']

## Overview
During model development and training I mainly focused on the recal. The obvious reason for this is that it is the
metric I am most used to, on the other side I still feel confident with this decision regarding the type of problem.

### Recall (Sensitivity / True Positive Rate)
In the following table we see the recall for given model and classes, labels.

In [4]:
tp_tfidf = percentage_true_positives(tfidf_predicted, tfidf_expected, column='TF-IDF')
tp_cnn = percentage_true_positives(cnn_predicted, cnn_expected, column='CNN')
tp_rnn = percentage_true_positives(rnn_predicted, rnn_expected, column='RNN')
tp_bert = percentage_true_positives(bert_predicted, bert_expected, column='BERT')
set_option('display.max_rows', 500)
set_option('display.max_columns', 500)
set_option('display.width', 1000)
set_option('display.expand_frame_repr', True)
concat([tp_tfidf, tp_cnn, tp_rnn, tp_bert], axis=1)

Unnamed: 0,TF-IDF,CNN,RNN,BERT
Etat,66.667,83.333,83.333,66.667
Inland,75.0,37.5,50.0,100.0
International,100.0,50.0,70.588,100.0
Kultur,100.0,40.0,20.0,100.0
Panorama,79.167,75.0,57.143,79.167
Sport,90.909,72.727,87.5,90.909
Web,92.308,69.231,68.182,84.615
Wirtschaft,84.615,46.154,61.538,84.615
Wissenschaft,100.0,60.0,0.0,80.0


TF-IDF as a base model already provided decent results in classification and it was quite hard to find a ML based model that
performed better.