# Results Analysis

## Metric Choices
- [Google-BLEU (GLEU)](https://web.science.mq.edu.au/~rdale/publications/papers/2007/gleu4ps2pdf.pdf): Alternative to BLEU that often aligns better with human judgements on MT tasks. GLEU measures precision and recall of all 1-4 grams and choses the minimum of the two. 
- [Character n-gram F-score (CHRF)](http://www.statmt.org/wmt15/pdf/WMT49.pdf): 
$$ (1 + \beta^2) \frac{CHRP \times CHRR}{\beta^2 CHRP + CHHRR} $$
where $CHRP$ is the percentage of n-grams in the predicted sequence that are in the target sequence and $CHRR$ is the percentage of character n-grams in the predicted sequence that are also in the target sequence 
- [BiLingual Evaluation Understudy (BLEU)](https://www.aclweb.org/anthology/P02-1040.pdf):
- Formality: The average predicted confidence each sequence is formal. Computed by neural network trained on separate labelled informal/formal corpus. Result is average softmax prediction for formal output. This model was trained to 83% accuracy 

In [1]:
from metrics.formality_classifier import FormalityClassifier

In [25]:
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
from nltk.translate.chrf_score import corpus_chrf
from nltk.translate.gleu_score import corpus_gleu
import more_itertools as mit
import pandas as pd
import numpy as np
import seaborn as sns

In [3]:
formality_classifier = FormalityClassifier()

In [54]:
def load_and_score(results_file_path, actual_file_path, num_groups=8, val=False):
    def replace(seq):
        if seq is not None:
            return seq.replace('<start>', '').replace('<end>', '')
        else:
            return ' '
    # load data
    actual = open(actual_file_path).read()
    results = open(results_file_path).read()

    actual = [replace(seq) for seq in actual.split('\n')]
    results = [replace(seq) for seq in results.split('\n')]

    if val:
        actual = actual[:2000]
        results = results[:2000]


    # split data into test groups
    split_size = len(actual) // num_groups
    actual_split = [actual[x:x+split_size] for x in range(0, len(actual), split_size)]
    results_split = [results[x:x+split_size] for x in range(0, len(results), split_size)]
    
    s = SmoothingFunction().method5

    # loop through 
    formality, bleu, gleu, chrf = [], [], [], []
    for a, r in zip(actual_split, results_split):
        formality.append(formality_classifier.classify(r))
        bleu.append(corpus_bleu(a, r, smoothing_function=s))
        chrf.append(corpus_chrf(a, r))
        gleu.append(corpus_gleu(a, r))

    df = pd.DataFrame(list(zip(bleu, gleu, chrf, formality)),
                      columns=['BLEU', 'GLEU', 'CHRF', 'FORMALITY'])

    print('BLEU: {:4f} | CHRF: {:4f} | FORMALITY: {:4f} | GLEU: {:4f}'.format(np.mean(bleu), 
                                                                              np.mean(chrf), 
                                                                              np.mean(formality),
                                                                              np.mean(gleu)))
    return df

In [37]:
BASE_PATH = 'Data/Results/'
actual = 'Data/Supervised Data/Entertainment_Music/S_Formal_EM_ValTest.txt'

## Results from GYAFC Paper

In [38]:
gyafc_results = 'Data/GYAFC_Corpus/Entertainment_Music/model_outputs/formal.nmt_baseline'
gyafc_actual = 'Data/GYAFC_Corpus/Entertainment_Music/test/formal.ref0'

In [39]:
gyafc_df = load_and_score(gyafc_results, gyafc_actual)

BLEU: 0.109758 | CHRF: 0.499192 | FORMALITY: 0.676480 | GLEU: 0.005461


## Vanilla Encoder Decoder Custom
The vanilla encoder decoder feeds the sequences into encoder to learn a latent representation. The decoder then iterates through the original sequence and uses the latent representation to predict a next word. This model was trained for 30 epochs on 25,0000 sequences. 

In [40]:
ved_df = load_and_score(BASE_PATH + 'vanilla_encoder_decoder_results_custom.txt', actual)

BLEU: 0.122410 | CHRF: 0.041665 | FORMALITY: 0.980703 | GLEU: 0.017241


## Bahdanau Attention
```
Informal:  <start> pretty woman but i cant remember who sings it .  <end>
Formal:  <start> The song is called Pretty Woman , but I cannot remember who sings it .  <end>
Predicted:  <start> i believe she is a woman i can not remember who sings the song <end> 
```

In [42]:
ba_df = load_and_score(BASE_PATH + 'Bahdanau_Attention_Results_Custom.txt', actual)

BLEU: 0.094351 | CHRF: 0.270998 | FORMALITY: 0.696809 | GLEU: 0.004408


## ONMT Transformer
ONMT transformer was trained on the first 2000 sequences of the test set, and the remaining sequences were used as validation.

In [43]:
onmt_T_df = load_and_score(BASE_PATH + 'onmt_transformer_output.txt', actual, val=True)

BLEU: 0.103317 | CHRF: 0.348424 | FORMALITY: 0.644124 | GLEU: 0.004962


## CRF POS Model
The CRF POS was a sequence2sequence model trained using [parallel encodings](https://arxiv.org/pdf/1804.09849.pdfhttps://arxiv.org/pdf/1804.09849.pdf)

In [44]:
crf_pos_df = load_and_score(BASE_PATH + 'crf_pos_seq2seq_predictions.txt', actual, val=True)

BLEU: 0.104216 | CHRF: 0.075835 | FORMALITY: 0.867475 | GLEU: 0.008274


## Transformer with Rules

In [45]:
rule_trans = load_and_score(BASE_PATH + 'rule_based_transformer.txt', actual, val=True)

BLEU: 0.105755 | CHRF: 0.127470 | FORMALITY: 0.783835 | GLEU: 0.006438


## CRF Pos Concat with Global Attention

In [53]:
pos_concat = load_and_score(BASE_PATH + 'CRF_POS_Concat.txt', actual)

BLEU: 0.070510 | CHRF: 0.044980 | FORMALITY: 0.541870 | GLEU: 0.002499


## Rule Concat with Global Attention

In [56]:
rule_concat = load_and_score(BASE_PATH + 'Rule_Concat.txt', actual)

BLEU: 0.077328 | CHRF: 0.037900 | FORMALITY: 0.571993 | GLEU: 0.005349
