## Quality Visualize

Instead of relying solely on BLEU score, we can load in generated prediction file, and map back to the test.

We randomly visualize 200 examples.

In [1]:
src_test = []
tgt_test = []

with open('/home/anie/OpenNMT-py/data/src-test.txt', 'r') as f:
    for line in f:
        src_test.append(line.strip())

with open('/home/anie/OpenNMT-py/data/tgt-test.txt', 'r') as f:
    for line in f:
        tgt_test.append(line.strip())

In [2]:
def load_model(path):
    tgt_test = []
    with open(path, 'r') as f:
        for line in f:
            tgt_test.append(line.strip())
    return tgt_test

In [3]:
best_model_tgt = load_model('/home/anie/OpenNMT-py/save/because_transformer_sep5/dissent_step_80000_pred.txt')

In [12]:
def display(best_model_tgt, i=0, sample=100):
    for idx in range(i, i+sample):
        print("gold src: {}".format(src_test[idx][:-1] + 'because'))
        print("gold tgt: {}".format(tgt_test[idx]))
        print("gen tgt: {}".format(best_model_tgt[idx]))
        print("------------")

In [13]:
display(best_model_tgt)

gold src: The stamp would undercut post office revenue in the future because
gold tgt: It would no longer get a financial gain when it raised rates .
gen tgt: It would be difficult to attract new customers .
------------
gold src: What , there s no such thing as failure because
gold tgt: For me the most important thing in life is family .
gen tgt: It is nt .
------------
gold src: Many have resisted buying LEDs because
gold tgt: They think they have to replace all their lights with kits for about $ 50 a pop .
gen tgt: They are not sure how much they will cost .
------------
gold src: Essentially , Wachovia was forced to write down the value of these assets because
gold tgt: They were considered overvalued compared with the market value -- or what Wells Fargo was willing to pay .
gen tgt: It was unable to sell them .
------------
gold src: A beautician is refusing to move her car despite racking up nearly 300 in tickets because
gold tgt: She claims a ' bullying ' council turned up unann

In [22]:
transformer_lm_books_tgt = load_model('/home/anie/OpenNMT-py/save/openai-transformer/tgt-test-freeze.txt')

In [24]:
display(transformer_lm_books_tgt, sample=20)

gold src: The stamp would undercut post office revenue in the future because
gold tgt: It would no longer get a financial gain when it raised rates .
gen tgt: of the fact that the government had been forced to unk pay for the stamp
------------
gold src: What , there s no such thing as failure because
gold tgt: For me the most important thing in life is family .
gen tgt: you have to
------------
gold src: Many have resisted buying LEDs because
gold tgt: They think they have to replace all their lights with kits for about $ 50 a pop .
gen tgt: they
------------
gold src: Essentially , Wachovia was forced to write down the value of these assets because
gold tgt: They were considered overvalued compared with the market value -- or what Wells Fargo was willing to pay .
gen tgt: they ere not in the market for the kind of money that the government as offering
------------
gold src: A beautician is refusing to move her car despite racking up nearly 300 in tickets because
gold tgt: She claims 

For some sentences, clearly the model has generated good response! However, where does it learn to generate them?? We query the training data, using edit distance to find the closest training point matching the test point, and see what it has.

Certain gold test target sentences are too specific.

## Analysis

We load in the training data and find the closest match to the interesting test sentence. We first compute the normalized edit distance. This can also be done with encoded vector distance (but that requires some wrangling with OpenNMT code).

In [14]:
src_train = []
tgt_train = []

with open('/home/anie/OpenNMT-py/data/src-train.txt', 'r') as f:
    for line in f:
        src_train.append(line.strip())

with open('/home/anie/OpenNMT-py/data/tgt-train.txt', 'r') as f:
    for line in f:
        tgt_train.append(line.strip())

In [15]:
print(len(src_train))

1992028


In [16]:
import editdistance

In [21]:
editdistance.eval(["this"], ['that'])

1

In [49]:
import numpy as np

def get_normalized_leven(a, b):
    return editdistance.eval(a, b) / max(len(a), len(b))

def display_with_training_point(best_model_tgt, i=0, closest=5):
    closest_dis = np.inf
    candidate_sents = []
    
    sentA = src_test[i].split()  # sentA
    for idx in range(len(src_train)):
        cur_dis = get_normalized_leven(sentA, src_train[idx].split())
        if cur_dis < closest_dis:
            if len(candidate_sents) < closest:
                candidate_sents.append((src_train[idx], tgt_train[idx]))
            else:
                candidate_sents.pop(0)  # remove first one
                candidate_sents.append((src_train[idx], tgt_train[idx]))
            closest_dis = cur_dis
    
    print("gold src: {}".format(src_test[i][:-1] + 'because'))
    print("gold tgt: {}".format(tgt_test[i]))
    print("gen tgt: {}".format(best_model_tgt[i]))
    print("------------")
    print("closest training pairs: ")
    for c_s in candidate_sents:
        print("gold src: {}".format(c_s[0]))
        print("gold tgt: {}".format(c_s[1]))
        print("===========")

In [34]:
best_model_tgt.index("He was born in Burkina Faso .")

38

In [61]:
display_with_training_point(best_model_tgt, i=38, closest=4)

gold src: That banned his most threatening challenger , Rally leader Alassane Ouattara , from running for president because
gold tgt: He is only half - Ivorian .
gen tgt: He was born in Burkina Faso .
------------
closest training pairs: 
gold src: That 's hard to say , you know , for work .
gold tgt: I work alongside guys who wear skintight leather leotards .
gold src: And the most popular politician , opposition leader Aung San Suu Kyi , is barred from running for presidency .
gold tgt: Her late husband and sons are foreign citizens .
gold src: That , authorities say , disqualified her from running for mayor .
gold tgt: She was n ' t a full - time resident .
gold src: That banned his most threatening challenger _ Rally leader Alassane Ouattara _ from running for president .
gold tgt: He is only half - Ivorian .


In [56]:
def search_keywords(keyword):
    obtained_results = []
    for idx in range(len(src_train)):
        if keyword in src_train[idx]:
            obtained_results.append((src_train[idx], tgt_train[idx]))
            
    for c_s in obtained_results:
        print("gold src: {}".format(c_s[0]))
        print("gold tgt: {}".format(c_s[1]))
        print("===========")
    
    #return obtained_results

In [57]:
search_keywords('Burkina Faso')

gold tgt: The government there has n ' t been able to restore order .
gold src: Selling Augusti products in Burkina Faso would be `` complicated , '' says Claudie Ramirez .
gold tgt: The price - a minimum of 18 euros -LRB- 14 ; $ 25 -RRB- - is far higher than most Burkinabe women are used to paying .
gold src: Burkina Faso must not be underestimated .
gold tgt: Just they are an unknown quantity to us .
gold src: Burkina Faso is a target of terrorism , according to Sean Smith , senior Africa analyst at Verisk Maplecroft , a global risk consultancy firm .
gold tgt: It contributes more troops than any other West African nation to the peacekeeping mission in Mali .
gold src: Nigeria 's friendly against Burkina Faso in London on Monday has been called off .
gold tgt: Seven of the Burkinabe players were unable to secure visas to Britain .
gold src: Thousands of women are dying every year during pregnancy and childbirth in the African state of Burkina Faso .
gold tgt: Discrimination stops the

In [58]:
search_keywords("Alassane Ouattara")

gold src: Notably , former prime minister Alassane Ouattara was barred from running in the last presidential election .
gold tgt: One of his parents was from Burkina Faso , Ivory Coast 's northern neighbor .
gold src: His rival Alassane Ouattara could nt stand for president .
gold tgt: His mother was nt Ivorian .
gold src: The opposition politician Alassane Ouattara was excluded .
gold tgt: He had foreign parents , from running in a presidential election in 2000 , despite having previously served as prime minister .
gold src: That banned his most threatening challenger _ Rally leader Alassane Ouattara _ from running for president .
gold tgt: He is only half - Ivorian .
gold src: That banned his most threatening challenger -- Rally leader Alassane Ouattara -- from running for president .
gold tgt: He is only half - Ivorian .
gold src: A key rival of Gbagbo , former prime minister and Muslim northerner Alassane Ouattara , was barred from running in the country 's last presidential electi

In [64]:
search_keywords("Rugby")

gold src: Rugby sevens has a good case to make .
gold tgt: It 's great to watch and very competitive .
gold src: I was keen to have a crack at Super Rugby .
gold tgt: It 's the best competition in the world and it 's going to get better next year with the Rebels coming on board .
gold src: The Australian Rugby Union needs to make the tough decision to cut a local Super Rugby team .
gold tgt: The current five - team model is not financially sustainable .
gold src: We have to change the kit for the Rugby World Cup .
gold tgt: There 's no sponsorship allowed .
gold src: How the manager and his employers at the Rugby Football Union must be praying that this proves to be true .
gold tgt: Johnson 's home record against major southern hemisphere opposition is Godawful .
gold src: All Blacks wing Jonah Lomu , struggling with a chronic kidney illness , cut his ties with the New Zealand Rugby Union , he has told a television current affairs show .
gold tgt: They did n ' t appreciate his value .
