# Testing Stanza, StanfordNLP Python package

### Using the training data, I will run experiments with Stanza's models.

We will test English-German (en-de), English-Chinese (en-zh)

In [67]:
import pandas as pd
import stanza

We will only work with the first 20 lines of each dataset for testing.

In [302]:
en_zh = pd.read_csv('input/english-chinese/en-zh/train.enzh.df.short.tsv', sep='\t', index_col='index')

In [303]:
en_zh.head()

Unnamed: 0_level_0,original,translation,scores,mean,z_scores,z_mean,model_scores
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,The last conquistador then rides on with his s...,最后的征服者骑着他的剑继续前进.,"[34, 38, 42]",38.0,"[-1.6082034156434533, -1.7847572356764654, -1....",-1.514119,-0.699063
1,He shoves Owen into the pit where Digger rips ...,他把欧文扔进了挖掘机挖出儿子心脏的坑里.,"[51, 49, 50, 70, 60, 50]",55.0,"[-0.9117078937258674, -1.6671266171635248, -0....",-0.600861,-0.627344
2,Alpha Phi Alpha also participates in the March...,Alpha Phi Alpha 还参加了 Dimes 'WalkAmerica 的 3 月活...,"[67, 95, 71, 65, 45, 50]",65.5,"[-0.14077147789135155, 0.891823783328571, 0.43...",-0.026856,-0.333988
3,"In 1995, Deftones released their debut album A...",1995 年 ， Deftones 发行了首张专辑《肾上腺素》。,"[83, 89, 92]",88.0,"[0.6605739498965285, 1.1664094557428424, 1.348...",1.05837,-0.32621
4,Kyrgios also supports the North Melbourne Kang...,基尔吉奥斯还在澳大利亚足球联盟中支持北墨尔本袋鼠足球俱乐部.,"[94, 81, 76]",83.666667,"[1.92791735961526, 0.5518925638710621, 0.45965...",0.979822,-0.613993


In [304]:
def mask(df, output_csv, source_lang, target_lang):
    
    #Set up models
    stanza.download(source_lang, processors='tokenize, ner',verbose=False)
    source_nlp= stanza.Pipeline(source_lang, processors='tokenize, ner',verbose=False)
    
    stanza.download(target_lang, processors='tokenize, ner',verbose=False)
    target_nlp= stanza.Pipeline(target_lang, processors='tokenize, ner',verbose=False)
    
    #read in dataframe
    #df = pd.read_csv(input_csv, sep='\t', index_col="index")
    #replace entities
    for index, row in df.iterrows():
        print(index)
        source, target = source_nlp(row.original), target_nlp(row.translation)

        source.mask_ents()
        target.mask_ents()

        df.at[index, "original"]=source.text
        df.at[index, "translation"]=target.text
    #export to csv
    df.to_csv(path_or_buf=output_csv, sep='\t')

### Now we should sort the sentences:
1) Attach the predicted score to the dataset<br>
2) Check the diff between the pred score and actual score (AE)<br>
3) Sort by descending AE<br>

In [305]:
predicted_en_zh = pd.read_csv('input/english-chinese/en-zh/train.enzh.predicted.zmean.tsv', sep='\t', index_col='index')

In [306]:
predicted_en_zh.shape

(7000, 3)

In [307]:
#Let ae be difference between real and predicted scores, squared.
ae = abs(en_zh.z_mean - predicted_en_zh.predictions)

In [308]:
#concatonate ae to the end of the dataset
en_zh = pd.concat([en_zh, ae], axis=1)

In [309]:
#rename new column to AE
en_zh.rename({0:"AE"}, axis=1, inplace=True)

In [310]:
#Sort by largest Absolute Error
en_zh.sort_values(by=["AE"], ascending=False, inplace=True)

In [311]:
en_zh.reset_index(drop=True, inplace=True)

In [312]:
top = en_zh.iloc[:140]
bottom = en_zh.iloc[140:]

In [313]:
top

Unnamed: 0,original,translation,scores,mean,z_scores,z_mean,model_scores,AE
0,"On 9 May, squadrons operating the Falchi were ...","5 月 9 日 ， 操作法尔奇的中队是第 3 次 ""科科特路段"" ， 14 次 ， 第 4 ...","[14, 1, 8]",7.666667,"[-3.088933867982062, -4.068257635479569, -6.31...",-4.491143,-0.426555,5.086370
1,"Some bivalves, such as the scallops and file s...","有些双人, 如剪刀和文件壳, 可以游泳.","[1, 7, 13]",7.000000,"[-3.7953628771475936, -6.412099234218103, -2.5...",-4.253234,-0.941824,4.940496
2,"Egan, Charles E. ""U.S. Curbs Exports of Steel ...","Egan, Charles E., ""U. S. Curbs Exportof Steel"".","[20, 1, 1]",7.333333,"[-2.7628897099056626, -6.987272792413507, -4.3...",-4.695833,-0.839375,4.868545
3,During the recording of Audioslave's last albu...,"在录制最后一张音像专辑《启示》时, 莫雷洛尝试了不同的放大器插座.","[1, 1, 1]",1.000000,"[-3.9795755993568718, -3.361146712837946, -3.7...",-3.712028,-0.594917,4.626586
4,";Articles Jack Anderson, ""Preserving Nijinska'...","文章杰克 · 安德森 ， ""保存尼金斯卡的货架。","[9, 16, 1]",8.666667,"[-3.360637333045728, -2.7141746410194005, -5.2...",-3.782337,-0.744969,4.517940
...,...,...,...,...,...,...,...,...
135,"Bowman Gum, Inc. v. Topps Chewing Gum, Inc., 1...","Bowman Gum, Inc. 诉 Topps Cheing Gum, Inc., 103...","[5, 2, 8]",5.000000,"[-3.578000105096661, -3.2225236317227526, -2.6...",-3.159478,-0.313180,2.951103
136,"Sweets made out of tamarind, pineapple or guav...",用桃木、菠萝或瓜瓦以及脱水的花板制成的甜菜也很受欢迎。,"[27, 33, 21]",27.000000,"[-2.3825048588165303, -1.7309380561664423, -2....",-2.125821,-0.748876,2.950992
137,Kung Fu Hustle at LoveHKFilm.com The Six Degre...,Kung Fu Hustle at LoveHKFilm. com & Chow 和 Kun...,"[1, 12, 23]",12.000000,"[-3.3208841932087294, -3.197615254007528, -1.4...",-2.669530,-0.434657,2.948788
138,'s Helicarrier and internally maintained by be...,"由超子的良性化身在内部维持的 ""救生员"" 。","[31, 42, 19]",30.666667,"[-2.1651420867655977, -2.056532112890583, -2.5...",-2.257855,-1.008049,2.947119


## We now have AE for each row.
### Let's list what datasets we want to test:

1) AE > MAE (0.6444)<br>
2) mask top 350<br>
3) mask top 175<br>
4) mask top 140<br>

In [None]:
mask(top, "output/default_en_zh/en_zh-masked-worst-2%-def.tsv", "en", "zh")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100


In [297]:
new_top = pd.read_csv("output/default_en_zh/en_zh-masked-worst-2%-def.tsv", sep='\t')

In [298]:
new_top.rename({"Unnamed: 0":"index"},axis=1, inplace=True)

In [299]:
new_top.set_index("index", inplace=True)

In [300]:
finished = pd.concat([new_top, bottom])

In [301]:
finished.to_csv("output/default_en_zh/en_zh-masked-worst-2%-def.tsv", sep='\t')