# Evaluation

In [4]:
import json

from pandas import concat, set_option, DataFrame

from data import file
from reporting.evaluation import plot_confusion_matrix, percentage_true_positives

## Load Prediction Data
Within each model I stored 100 samples to generate confusion matrices and statistics here.
It allowed me to generate metrics and comparisons here. The stored data has the following format and is under
[version control](../../data/processed/)

```
{
"expected": ["International", "Etat", ...],
"predicted": ["International", "Etat", ...]
}```



In [5]:
with open(file.reporting_data_report_tfidf, 'r') as f:
    tfidf_raw = json.load(f)
    tfidf_expected = tfidf_raw['expected']
    tfidf_predicted = tfidf_raw['predicted']

with open(file.reporting_data_report_cnn, 'r') as f:
    cnn_raw = json.load(f)
    cnn_expected = cnn_raw['expected']
    cnn_predicted = cnn_raw['predicted']

with open(file.reporting_data_report_rnn, 'r') as f:
    rnn_raw = json.load(f)
    rnn_expected = rnn_raw['expected']
    rnn_predicted = rnn_raw['predicted']

with open(file.reporting_data_report_bert, 'r') as f:
    bert_raw = json.load(f)
    bert_expected = bert_raw['expected']
    bert_predicted = bert_raw['predicted']

## Result
During model development and training I mainly focused on the recall (sensitivity / true positive rate). The obvious
reason is that it's the metric I am most used to and I was not aware that deciding which metric to use might be an active
decision to do upfront. On the other hand it feels quite obvious to use recall considering the type of Problem to solve.

The following table shows a comparision of recalls for models that I investigated:

In [6]:
tp_tfidf = percentage_true_positives(tfidf_predicted, tfidf_expected, column='TF-IDF')
tp_cnn = percentage_true_positives(cnn_predicted, cnn_expected, column='CNN')
tp_rnn = percentage_true_positives(rnn_predicted, rnn_expected, column='RNN')
tp_bert = percentage_true_positives(bert_predicted, bert_expected, column='BERT')
set_option('display.max_rows', 500)
set_option('display.max_columns', 500)
set_option('display.width', 1000)
set_option('display.expand_frame_repr', True)

table = concat([tp_tfidf, tp_cnn, tp_rnn, tp_bert], axis=1)
table.loc['Mean'] = table.mean()
table

Unnamed: 0,TF-IDF,CNN,RNN,BERT
Etat,66.667,83.333,83.333,66.667
Inland,75.0,37.5,50.0,100.0
International,100.0,50.0,70.588,100.0
Kultur,100.0,40.0,20.0,100.0
Panorama,79.167,75.0,57.143,79.167
Sport,90.909,72.727,87.5,90.909
Web,92.308,69.231,68.182,84.615
Wirtschaft,84.615,46.154,61.538,84.615
Wissenschaft,100.0,60.0,0.0,80.0
Mean,87.629556,59.327222,55.364889,87.330333


### Base Model
For the base model predictions I was using TF-IDF with preprocessed data. During preprocessing I generated raw, lemmatized
and stemmed tokens. Out of this I got best average recall for the lemmatized input and decided to use that.
With an average recall of around 87% TF-IDF was in my opinion already providing a very good base model performance.

#### Libraries: sklearn, matplotlib, pandas, seaborn

### Neural Networks
While trying neural networks the main questions to answer were the following:
- Is the out-of-the-box performance of a given neural network much better than the one from the base model?
- How good is the trainability of a given network? Or in other words, how much can I improve the performance by tuning?

### CNN
I was building a CNN using a pretrained fasttext word embeddings followed by convolutional and max pooling layer.
The out-of-the-box average recall was at around 59% and much worse than the base model.

#### Layers
1. Input Layer
2. Text Vectorization
3. Word Embedding (fasttext, pretrained)
4. Convolutional
5. Global Max Pooling
6. Dense (softmax)

#### Libraries: tensorflow, sklearn, matplotlib, pandas, seaborn

#### Decision:
I expected to get better performance using RNNs and LSTMs out of the fact that this were more traditional approaches
for text classification prior 2018. That was the main reason to not spend lots of time in tuning.

### RNN
The RNN I built was based on a word embedding layer (fasttext) followed by a bidirectional LSTM layer containing 32 units.
The out-of-the-box average recall was at around 55%.

#### Layers
1. Input Layer
2. Text Vectorization
3. Word Embedding (fasttext, pretrained)
4. Bidirectional LSTM
5. Global Max Pooling
6. Dense (softmax)

#### Libraries: tensorflow, sklearn, matplotlib, pandas, seaborn

#### Tuning
I did tuning based on tensorflow board hparams which was one of the native tensorflow approaches I found. The tuning was
mainly based on the following parameters:
- number of LSTM layers
- nuber of LSTM units
- bidirectional / unidirectional

#### Decision
The tuning did not end up very successfull and the model performance (recall) trend was downwards. Because of that I
decided to stop tuning and give transformers a try.

### BERT Transformer
In order to test transformer approaches I was using a pretrained tokenizer and model (bert-base-german-cased) from
[huggingface](https://huggingface.co/bert-base-german-cased). The model was available for PyTorch and Tensorflow and I
decided to use Tensorflow. For activation I used softmax.
The out-of-the-box average recall was at around 87%


#### Layers
0. Pretrained Bert Tokenizer (bert-base-german-cased)
1. Input Layer
2. Pretrained Bert Model (bert-base-german-cased)
3. Dense (softmax)

#### Libraries: tensorflow, transformers, sklearn, matplotlib, pandas, seaborn


## Recommendation
TF-IDF and Transformers have similar performance. Because of that I would recommend one of the following options:

A) If the performance of around 87% is acceptable I would go with TF-IDF because of the following reasons:
   - simpler, less resource intensive, faster, less expensive
   - TF-IDF is better explainable than a neural network

B) If performance of around 87% is not good enough I would go and tune transformers due to the following reasons:
   - I would expect the models to have more room for improvement than my current base model based on TF-IDF
   - Last but not least: for a startup I would definitly go with ML because it increases the value of the company ;)