In [1]:
import pandas as pd
from scipy import stats

# STSBenchmark

https://ixa2.si.ehu.es/stswiki on dev

In [2]:
col_names = ['filename', 'genre', 'year', 'id', 'score', 'sentence1', 'sentence2']
target = pd.read_csv('data/sts-dev.csv', sep='\t', quoting=3, header=None, names=col_names)['score']
bm25 = pd.read_csv('data/sts_elsearch_dev.csv')['score']
bert_sent = pd.read_csv('data/sts_bert_dev.csv')['score']
universal = pd.read_csv('data/sts_universal_dev.csv')['score']

#### Pearson correlation coefficient

In [3]:
print('BM25:', '% .2f' % (stats.pearsonr(target, bm25)[0] * 100))
print('BERT:', '% .2f' % (stats.pearsonr(target, bert_sent)[0] * 100))
print('USE:', '% .2f' % (stats.pearsonr(target, universal)[0] * 100))

BM25:  68.48
BERT:  87.96
USE:  83.64


# Idioms

IBM Debater - Sentiment Lexicon of IDiomatic Expressions (SLIDE) (https://www.research.ibm.com/haifa/dept/vst/ debating_data.shtml) (~4500) + definition from wiktionary.org

In [4]:
idioms = pd.read_csv('data/idioms.csv')

### BM25

In [5]:
elsearch = pd.read_csv('data/elsearch_idioms.csv')
print('Columns:', list(elsearch.columns))
elsearch.drop_duplicates(subset=['article'])[['article', 'candidate', 'score']]

Columns: ['article', 'content', 'candidate', 'definition', 'score', 'type']


Unnamed: 0,article,candidate,score
0,Someone tried to sell a baby on eBay for $5K,here we go again,26.253231
5,Florist busted for stealing flowers off graves,easier said than done,20.275879
10,Why Nintendo’s new console tastes awful,keep one's mouth shut,29.959396
15,Murder plot exposed after man sent ‘kill my wi...,what you see is what you get,17.713026
20,Man who claims to be 146 years old wants to die,have the time of one's life,18.638319
25,Trump: I Have More Foreign Policy Experience T...,take matters into one's own hands,50.313915
30,Trump releases bromance letter from Putin,get one's money's worth,14.363069
35,FBI releases nearly 200 pages of Clinton email...,what do you say,27.248457
40,Presidential debate expected to be the most-wa...,have the time of one's life,13.970611
45,The Ancient Military Barracks Hidden Under Rom...,get one's money's worth,15.732018


### BERT

In [6]:
bert = pd.read_csv('data/bert_idioms.csv')
print('Columns:', list(bert.columns))
bert.drop_duplicates(subset=['article'])[['article', 'candidate', 'distance']]

Columns: ['article', 'content', 'candidate', 'definition', 'distance', 'type']


Unnamed: 0,article,candidate,distance
0,Someone tried to sell a baby on eBay for $5K,give someone a hard time,0.7833
5,Florist busted for stealing flowers off graves,keep up with the Joneses,0.8028
10,Why Nintendo’s new console tastes awful,give someone a hard time,0.6761
15,Murder plot exposed after man sent ‘kill my wi...,keep up with the Joneses,0.6818
20,Man who claims to be 146 years old wants to die,put on one's dancing shoes,0.7576
25,Trump: I Have More Foreign Policy Experience T...,best thing since sliced bread,0.7379
30,Trump releases bromance letter from Putin,put on one's dancing shoes,0.6526
35,FBI releases nearly 200 pages of Clinton email...,and be done with it,0.6773
40,Presidential debate expected to be the most-wa...,have the time of one's life,0.7891
45,The Ancient Military Barracks Hidden Under Rom...,strike while the iron is hot,0.6791


### USE

In [7]:
use = pd.read_csv('data/universal_idioms.csv')
print('Columns:', list(use.columns))
use.drop_duplicates(subset=['article'])[['article', 'candidate', 'distance']]

Columns: ['article', 'content', 'candidate', 'definition', 'distance', 'type']


Unnamed: 0,article,candidate,distance
0,Someone tried to sell a baby on eBay for $5K,pot calling the kettle black,0.8575
5,Florist busted for stealing flowers off graves,take the law into one's own hands,0.8499
10,Why Nintendo’s new console tastes awful,give someone a hard time,0.9011
15,Murder plot exposed after man sent ‘kill my wi...,have blood on one's hands,0.8403
20,Man who claims to be 146 years old wants to die,take someone's word for it,0.8398
25,Trump: I Have More Foreign Policy Experience T...,get one's foot in the door,0.8375
30,Trump releases bromance letter from Putin,get on the end of,0.8284
35,FBI releases nearly 200 pages of Clinton email...,genie is out of the bottle,0.8478
40,Presidential debate expected to be the most-wa...,come out of the woodwork,0.8801
45,The Ancient Military Barracks Hidden Under Rom...,let the cat out of the bag,0.8702


In [8]:
use.iloc[15]['article']

'Murder plot exposed after man sent ‘kill my wife’ texts to wrong\xa0contact'

#### Intersection

In [9]:
print('BM25 & BERT:', len(pd.merge(elsearch, bert, how='inner', on=['article', 'candidate'])) / 50)
print('BM25 & USE:', len(pd.merge(elsearch, use, how='inner', on=['article', 'candidate'])) / 50)
print('BERT & USE:', len(pd.merge(bert, use, how='inner', on=['article', 'candidate'])) / 50)

BM25 & BERT: 0.08
BM25 & USE: 0.1
BERT & USE: 0.32


# Results

In [10]:
def search_definition(idiom):
    return idioms[idioms['idiom'] == idiom].reset_index()['definition'][0]    

Sentence-BERT показывает лучший результат на STSBenchmark (так как его вообще и натренировали на это) и выдает кандидатов "ближе" чем Universal Sentence Encoder. Но если посмотреть результаты с идиомами, то по смыслу лучше справляется USE, так как он лучше работает с параграфами. 

Например, рассмотрим статью с заголовком `"FBI releases nearly 200 pages of Clinton email probe documents" (из заголовка примерно понятно о чем она)`. 

BERT выдает идиому `"and be done with it"`: '(idiomatic) Used to terminate discussion or delay with a call to action' 

В то время как USE выдает `"genie is out of the bottle"`: '(idiomatic) Information has been released that will have ongoing consequences. (idiomatic) Something has been brought into reality that cannot be eliminated or undone.'

Или статью `"Murder plot exposed after man sent ‘kill my wife’ texts to wrong contact"`.

BERT - `"keep up with the Joneses"`: '(idiomatic) To act or make purchases for status or image rather than out of need, especially for the purpose of competing with friends, neighbors, or society.'

USE - `"have blood on one's hands"`: '(idiomatic) To be responsible for a violent act.'


В остальных вариантах либо и BERT и USE выдают примерно одинково релевантые идиомы, либо оба ошибаются. 

### BERT for lead

In [14]:
bert_lead = pd.read_csv('data/bert_idioms_lead.csv')
print('Columns:', list(bert_lead.columns))
bert_lead.drop_duplicates(subset=['article'])[['article', 'candidate', 'distance']]

Columns: ['article', 'content', 'candidate', 'definition', 'distance', 'type']


Unnamed: 0,article,candidate,distance
0,Someone tried to sell a baby on eBay for $5K,keep up with the Joneses,0.7882
5,Florist busted for stealing flowers off graves,keep up with the Joneses,0.7689
10,Why Nintendo’s new console tastes awful,take a dim view of,0.6406
15,Murder plot exposed after man sent ‘kill my wi...,keep up with the Joneses,0.6565
20,Man who claims to be 146 years old wants to die,have one's cake and eat it too,0.7135
25,Trump: I Have More Foreign Policy Experience T...,best thing since sliced bread,0.7563
30,Trump releases bromance letter from Putin,find it in one's heart,0.6958
35,FBI releases nearly 200 pages of Clinton email...,come out of the closet,0.7156
40,Presidential debate expected to be the most-wa...,come in from the cold,0.8211
45,The Ancient Military Barracks Hidden Under Rom...,strike while the iron is hot,0.5907


### USE for lead

In [13]:
use_lead = pd.read_csv('data/universal_idioms_lead.csv')
print('Columns:', list(use_lead.columns))
use_lead.drop_duplicates(subset=['article'])[['article', 'candidate', 'distance']]

Columns: ['article', 'content', 'candidate', 'definition', 'distance', 'type']


Unnamed: 0,article,candidate,distance
0,Someone tried to sell a baby on eBay for $5K,take the law into one's own hands,0.8829
5,Florist busted for stealing flowers off graves,take the law into one's own hands,0.8247
10,Why Nintendo’s new console tastes awful,strike while the iron is hot,0.8922
15,Murder plot exposed after man sent ‘kill my wi...,have blood on one's hands,0.8775
20,Man who claims to be 146 years old wants to die,put on one's dancing shoes,0.8958
25,Trump: I Have More Foreign Policy Experience T...,get one's foot in the door,0.864
30,Trump releases bromance letter from Putin,get on the end of,0.83
35,FBI releases nearly 200 pages of Clinton email...,genie is out of the bottle,0.8246
40,Presidential debate expected to be the most-wa...,it takes two to tango,0.873
45,The Ancient Military Barracks Hidden Under Rom...,come in from the cold,0.9071


In [20]:
print('BERT & BERT_lead:', len(pd.merge(bert_lead, bert, how='inner', on=['article', 'candidate'])) / 50)
print('USE & USE_lead:', len(pd.merge(use_lead, use, how='inner', on=['article', 'candidate'])) / 50)
print('BERT_lead & USE_lead:', len(pd.merge(use_lead, bert_lead, how='inner', on=['article', 'candidate'])) / 50)

BERT & BERT_lead: 0.68
USE & USE_lead: 0.54
BERT_lead & USE_lead: 0.2


Можно заметить, что оба метода стали чаще выдавать на разные статьи одинаковые идиомы (видимо из-за небольшого контекста). При этом в некоторых местах все таки USE работает лучше, например для `"FBI releases nearly 200 pages of Clinton email probe documents"` BERT выдает `"come out of the closet"`: "(intransitive, idiomatic) To tell others about one's homosexuality, bisexuality, transness, or any minority or disapproved-of belief, preference, etc., where previously this had been kept secret." делая акцент на "секрете", хотя по основному смыслу идиома совершенно не подходит. 