# Results Analysis

## Metric Choices
- [Google-BLEU (GLEU)](https://web.science.mq.edu.au/~rdale/publications/papers/2007/gleu4ps2pdf.pdf): Alternative to BLEU that often aligns better with human judgements on MT tasks. GLEU measures precision and recall of all 1-4 grams and choses the minimum of the two. 
- [Character n-gram F-score (CHRF)](http://www.statmt.org/wmt15/pdf/WMT49.pdf): 
$$ (1 + \beta^2) \frac{CHRP \times CHRR}{\beta^2 CHRP + CHHRR} $$
where $CHRP$ is the percentage of n-grams in the predicted sequence that are in the target sequence and $CHRR$ is the percentage of character n-grams in the predicted sequence that are also in the target sequence 
- [BiLingual Evaluation Understudy (BLEU)](https://www.aclweb.org/anthology/P02-1040.pdf):
- Formality: The average predicted confidence each sequence is formal. Computed by neural network trained on separate labelled informal/formal corpus. Result is average softmax prediction for formal output. This model was trained to 83% accuracy 

In [1]:
from metrics.formality_classifier import FormalityClassifier

In [25]:
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
from nltk.translate.chrf_score import corpus_chrf
from nltk.translate.gleu_score import corpus_gleu
import more_itertools as mit
import pandas as pd
import numpy as np
import seaborn as sns

In [3]:
formality_classifier = FormalityClassifier()

In [20]:
def load_and_score(results_file_path, actual_file_path, num_groups=8, val=False):
    # load data
    actual = open(actual_file_path).read()
    results = open(results_file_path).read()

    replace = lambda seq: seq.replace('<start>', '').replace('<end>', '')

    actual = [replace(seq) for seq in actual.split('\n')]
    results = [replace(seq) for seq in results.split('\n')]

    if val:
        actual = actual[:2000]
        resutls = results[:2000]

    # split data into test groups
    split_size = len(actual) // num_groups
    actual_split = [actual[x:x+split_size] for x in range(0, len(actual), split_size)]
    results_split = [results[x:x+split_size] for x in range(0, len(results), split_size)]
    
    s = SmoothingFunction().method1

    # loop through 
    formality, bleu, gleu, chrf = [], [], [], []
    for a, r in zip(actual_split, results_split):
        formality.append(formality_classifier.classify(r))
        bleu.append(corpus_bleu(a, r, weights=(1,0,0,0)))
        chrf.append(corpus_chrf(a, r))
        gleu.append(corpus_gleu(a, r))

    df = pd.DataFrame(list(zip(bleu, gleu, chrf, formality)),
                      columns=['BLEU', 'GLEU', 'CHRF', 'FORMALITY'])

    print('BLEU: {:4f} | CHRF: {:4f} | FORMALITY: {:4f} | GLEU: {:4f}'.format(np.mean(bleu), 
                                                                              np.mean(chrf), 
                                                                              np.mean(formality),
                                                                              np.mean(gleu)))
    return df

In [5]:
BASE_PATH = 'Data/Results/'
actual = 'Data/Supervised Data/Entertainment_Music/S_Formal_EM_ValTest.txt'

## Results from GYAFC Paper

In [6]:
gyafc_results = 'Data/GYAFC_Corpus/Entertainment_Music/model_outputs/formal.nmt_baseline'
gyafc_actual = 'Data/GYAFC_Corpus/Entertainment_Music/test/formal.ref0'

In [7]:
gyafc_df = load_and_score(gyafc_results, gyafc_actual)

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


BLEU: 0.355480 | CHRF: 0.499192 | FORMALITY: 0.676480 | GLEU: 0.005461


## Vanilla Encoder Decoder Custom
The vanilla encoder decoder feeds the sequences into encoder to learn a latent representation. The decoder then iterates through the original sequence and uses the latent representation to predict a next word. This model was trained for 30 epochs on 25,0000 sequences. 

In [8]:
ved_df = load_and_score(BASE_PATH + 'vanilla_encoder_decoder_results_custom.txt', actual)

BLEU: 0.454090 | CHRF: 0.041665 | FORMALITY: 0.980703 | GLEU: 0.017241


## Custom Transformer Results

In [9]:
ct_df = load_and_score(BASE_PATH + 'Custom_Transformer_Results.txt', actual)

ValueError: in user code:

    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:1147 predict_function  *
        outputs = self.distribute_strategy.run(
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:1122 predict_step  **
        return self(x, training=False)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:927 __call__
        outputs = call_fn(cast_inputs, *args, **kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/sequential.py:277 call
        return super(Sequential, self).call(inputs, training=training, mask=mask)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py:719 call
        convert_kwargs_to_constants=base_layer_utils.call_context().saving)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py:888 _run_internal_graph
        output_tensors = layer(computed_tensors, **kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/wrappers.py:531 __call__
        return super(Bidirectional, self).__call__(inputs, **kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:927 __call__
        outputs = call_fn(cast_inputs, *args, **kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/wrappers.py:645 call
        initial_state=forward_state, **kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/recurrent.py:654 __call__
        return super(RNN, self).__call__(inputs, **kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:927 __call__
        outputs = call_fn(cast_inputs, *args, **kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/recurrent_v2.py:1187 call
        runtime) = lstm_with_backend_selection(**normal_lstm_kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/recurrent_v2.py:1566 lstm_with_backend_selection
        **params)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py:2419 __call__
        graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py:2777 _maybe_define_function
        graph_function = self._create_graph_function(args, kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py:2667 _create_graph_function
        capture_by_value=self._capture_by_value),
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py:981 func_graph_from_py_func
        func_outputs = python_func(*func_args, **func_kwargs)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/recurrent_v2.py:1320 standard_lstm
        zero_output_for_mask=zero_output_for_mask)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/backend.py:4088 rnn
        [inp[0] for inp in flatted_inputs])
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/backend.py:4088 <listcomp>
        [inp[0] for inp in flatted_inputs])
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:984 _slice_helper
        name=name)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:1150 strided_slice
        shrink_axis_mask=shrink_axis_mask)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py:10179 strided_slice
        shrink_axis_mask=shrink_axis_mask, name=name)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:744 _apply_op_helper
        attrs=attr_protos, op_def=op_def)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py:595 _create_op_internal
        compute_device)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:3327 _create_op_internal
        op_def=op_def)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:1817 __init__
        control_input_ops, op_def)
    /home/sean/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:1657 _create_c_op
        raise ValueError(str(e))

    ValueError: slice index 0 of dimension 0 out of bounds. for '{{node strided_slice_1}} = StridedSlice[Index=DT_INT32, T=DT_FLOAT, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1](transpose, strided_slice_1/stack, strided_slice_1/stack_1, strided_slice_1/stack_2)' with input shapes: [0,?,200], [1], [1], [1] and with computed input tensors: input[1] = <0>, input[2] = <1>, input[3] = <1>.


## Bahdanau Attention
```
Informal:  <start> pretty woman but i cant remember who sings it .  <end>
Formal:  <start> The song is called Pretty Woman , but I cannot remember who sings it .  <end>
Predicted:  <start> i believe she is a woman i can not remember who sings the song <end> 
```

In [10]:
ba_df = load_and_score(BASE_PATH + 'Bahdanau_Attention_Results_Custom.txt', actual)

BLEU: 0.235396 | CHRF: 0.270998 | FORMALITY: 0.696809 | GLEU: 0.004408


## ONMT Transformer
ONMT transformer was trained on the first 2000 sequences of the test set, and the remaining sequences were used as validation.

In [21]:
onmt_T_df = load_and_score(BASE_PATH + 'onmt_transformer_output.txt', actual, val=True)

BLEU: 0.305277 | CHRF: 0.348424 | FORMALITY: 0.644124 | GLEU: 0.004962


## CRF POS Model
The CRF POS was a sequence2sequence model trained using [parallel encodings](https://arxiv.org/pdf/1804.09849.pdfhttps://arxiv.org/pdf/1804.09849.pdf)

In [23]:
crf_pos_df = load_and_score(BASE_PATH + 'crf_pos_seq2seq_predictions.txt', actual, val=True)

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


BLEU: 0.312281 | CHRF: 0.075835 | FORMALITY: 0.867475 | GLEU: 0.008274


In [24]:
crf_pos_df

Unnamed: 0,BLEU,GLEU,CHRF,FORMALITY
0,0.313285,0.008327,0.071902,0.870863
1,0.316806,0.008575,0.074554,0.884183
2,0.314692,0.008402,0.070211,0.870336
3,0.306919,0.008037,0.078273,0.862484
4,0.31059,0.008178,0.080606,0.863799
5,0.319035,0.00838,0.079612,0.866937
6,0.306343,0.008154,0.077076,0.86379
7,0.310576,0.008137,0.074447,0.857407


## Transformer with Rules

In [31]:
rule_trans = load_and_score(BASE_PATH + 'rule_based_transformer.txt', actual, val=True)

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


BLEU: 0.324278 | CHRF: 0.127470 | FORMALITY: 0.783835 | GLEU: 0.006438
