# Results Analysis

## Metric Choices
- [Google-BLEU (GLEU)](https://web.science.mq.edu.au/~rdale/publications/papers/2007/gleu4ps2pdf.pdf): Alternative to BLEU that often aligns better with human judgements on MT tasks. GLEU measures precision and recall of all 1-4 grams and choses the minimum of the two. 
- [Character n-gram F-score (CHRF)](http://www.statmt.org/wmt15/pdf/WMT49.pdf): 
$$ (1 + \beta^2) \frac{CHRP \times CHRR}{\beta^2 CHRP + CHHRR} $$
where $CHRP$ is the percentage of n-grams in the predicted sequence that are in the target sequence and $CHRR$ is the percentage of character n-grams in the predicted sequence that are also in the target sequence 
- [BiLingual Evaluation Understudy (BLEU)](https://www.aclweb.org/anthology/P02-1040.pdf):
- Formality: The average predicted confidence each sequence is formal. Computed by neural network trained on separate labelled informal/formal corpus. Result is average softmax prediction for formal output. This model was trained to 83% accuracy 

In [1]:
from metrics.formality_classifier import FormalityClassifier

In [2]:
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
from nltk.translate.chrf_score import corpus_chrf
from nltk.translate.gleu_score import corpus_gleu
import pandas as pd
import numpy as np

In [3]:
formality_classifier = FormalityClassifier()

In [4]:
def load_and_score(actual_file_path, results_file_path, num_groups=8, val=False):
    # load data
    actual = open(actual_file_path).read()
    results = open(results_file_path).read()
    
    actual = [seq for seq in actual.split('\n')]
    results = [seq.replace('<start>', '').replace('<end>', '') for seq in results.split('\n')]
    
    if val:
        acutal = actual[:2000]
        resutls = results[:2000]
    
    # split data into test groups
    split_size = len(actual) // num_groups
    actual_split = [actual[x:x+split_size] for x in range(0, len(actual), split_size)]
    results_split = [results[x:x+split_size] for x in range(0, len(results), split_size)]
    
    # define smoothing funciton for BLEU
    # method 5 averages counts for n-1, n, n+1 grams
    s = SmoothingFunction().method5
    
    # loop through 
    formality, bleu, gleu, chrf = [], [], [], []
    for a, r in zip(actual_split, results_split):
        formality.append(formality_classifier.classify(a))
        bleu.append(corpus_bleu(a, r, weights=(1,0,0,0)))
        chrf.append(corpus_chrf(a, r))
        gleu.append(corpus_gleu(a, r))
    
    df = pd.DataFrame(list(zip(bleu, gleu, chrf, formality)),
                      columns=['BLEU', 'GLEU', 'CHRF', 'FORMALITY'])
    
    print('BLEU: {:4f} | CHRF: {:4f} | FORMALITY: {:4f} | GLEU: {:4f}'.format(np.mean(bleu), 
                                                                              np.mean(chrf), 
                                                                              np.mean(formality),
                                                                              np.mean(gleu)))
    return df

In [5]:
BASE_PATH = 'Data/Results/'
actual = 'Data/Supervised Data/Entertainment_Music/S_Formal_EM_ValTest.txt'

## Results from GYAFC Paper

In [6]:
gyafc_results = 'Data/GYAFC_Corpus/Entertainment_Music/model_outputs/formal.nmt_baseline'
gyafc_actual = 'Data/GYAFC_Corpus/Entertainment_Music/test/formal.ref0'

In [7]:
gyafc_df = load_and_score(gyafc_results, gyafc_actual)

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


BLEU: 0.306912 | CHRF: 0.563394 | FORMALITY: 0.676480 | GLEU: 0.004694


## Vanilla Encoder Decoder Custom
The vanilla encoder decoder feeds the sequences into encoder to learn a latent representation. The decoder then iterates through the original sequence and uses the latent representation to predict a next word. This model was trained for 30 epochs on 25,0000 sequences. 

In [8]:
ved_df = load_and_score(BASE_PATH + 'vanilla_encoder_decoder_results_custom.txt', actual)

BLEU: 0.193423 | CHRF: 0.121051 | FORMALITY: 0.995549 | GLEU: 0.004617


## Custom Transformer Results

In [9]:
ct_df = load_and_score(BASE_PATH + 'Custom_Transformer_Results.txt', actual)

BLEU: 0.065263 | CHRF: 0.065662 | FORMALITY: 0.558764 | GLEU: 0.004617


## Bahdanau Attention
```
Informal:  <start> pretty woman but i cant remember who sings it .  <end>
Formal:  <start> The song is called Pretty Woman , but I cannot remember who sings it .  <end>
Predicted:  <start> i believe she is a woman i can not remember who sings the song <end> 
```

In [10]:
ba_df = load_and_score(BASE_PATH + 'Bahdanau_Attention_Results_Custom.txt', actual)

BLEU: 0.258770 | CHRF: 0.243385 | FORMALITY: 0.698479 | GLEU: 0.004617


## ONMT Transformer
ONMT transformer was trained on the first 2000 sequences of the test set, and the remaining sequences were used as validation

In [11]:
onmt_T_df = load_and_score(BASE_PATH + 'onmt_transformer_output.txt', actual, val=True)

BLEU: 0.279566 | CHRF: 0.385392 | FORMALITY: 0.644124 | GLEU: 0.004531
