Demo of Sentence Transforemers(https://www.sbert.net/index.html)

In [2]:
import pandas as pd
import numpy as np
import torch

In [3]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('allenai/scibert_scivocab_uncased')
model.max_seq_length = 512

In [1]:
# Two lists of sentences
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-distilroberta-base-v1')
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.4579
A man is playing guitar 		 A woman watches TV 		 Score: 0.1759
The new movie is awesome 		 The new movie is so great 		 Score: 0.9283


In [5]:
import json

Loading function taken from: https://www.kaggle.com/foolofatook/zero-shot-classification-with-huggingface-pipeline

In [4]:
data_file = r'd:/arxivDS/arxiv-metadata-oai-snapshot.json'

""" Using `yield` to load the JSON file in a loop to prevent Python memory issues if JSON is loaded directly"""

def get_metadata():
    with open(data_file, 'r') as f:
        for line in f:
            yield line

In [6]:
metadata = get_metadata()
ids = []
titles = []
abstracts = []
categories = []

for paper in metadata:
    metaDict = json.loads(paper)
    try:
        try:
            year = int(metaDict['journal-ref'][-4:])    ### Example Format: "Phys.Rev.D76:013009,2007"
        except:
            year = int(metaDict['journal-ref'][-5:-1])    ### Example Format: "Phys.Rev.D76:013009,(2007)"
        if(year == 2020):
            ids.append(metaDict['id'])
            titles.append(metaDict['title'])
            abstracts.append(metaDict['abstract'])
            categories.append(metaDict['categories'])
    except:
        pass

In [7]:
df = pd.DataFrame({'id' : ids,'title' : titles,'abstract' : abstracts, 'categories' : categories})


print(len(df))

26558


In [8]:
df.head()

Unnamed: 0,id,title,abstract,categories
0,712.1975,Reentrant spin glass transition in LuFe2O4,We have carried out a comprehensive investig...,cond-mat.str-el cond-mat.mtrl-sci
1,804.3104,"Teichm\""uller Structures and Dual Geometric Gi...",The Gibbs measure theory for smooth potentia...,math.DS math.CV
2,810.5491,Nonequilibrium phase transition in a spreading...,We consider a nonequilibrium process on a ti...,cond-mat.stat-mech
3,902.3288,Origin and evolution of cosmic accelerators - ...,One of the most tantalizing questions in ast...,astro-ph.CO astro-ph.HE
4,908.2605,A use of geometric calculus to reduce Berezin ...,Berezin integration of functions of anticomm...,gr-qc


In [9]:
cat_list= df['categories'].unique()
print(*cat_list, sep = "\n")

cond-mat.str-el cond-mat.mtrl-sci
math.DS math.CV
cond-mat.stat-mech
astro-ph.CO astro-ph.HE
gr-qc
astro-ph.IM astro-ph.EP
cond-mat.str-el cond-mat.mes-hall
cond-mat.dis-nn cs.DM math.CO
math.DS
astro-ph.IM astro-ph.CO
hep-ph hep-lat hep-th
math-ph math.MP
physics.soc-ph cond-mat.dis-nn cs.SI
quant-ph
cond-mat.soft
physics.plasm-ph
nucl-ex
math.SG
math.CT
math.ST stat.TH
math.NT
cond-mat.mtrl-sci
nucl-th nucl-ex
cond-mat.stat-mech cond-mat.dis-nn
math.CO
math-ph cond-mat.mtrl-sci hep-th math.MP quant-ph
gr-qc hep-th
cond-mat.mes-hall
math-ph hep-th math.MP
math.QA math.CO math.CT
math.CA math-ph math.MP
math.OC math.PR
cs.IT math.AG math.IT
gr-qc hep-th quant-ph
hep-ph cond-mat.str-el hep-th
cs.IT math.IT math.PR quant-ph
cs.DS
nucl-th math-ph math.MP quant-ph
math.RA math.AC
cond-mat.supr-con cond-mat.str-el
cond-mat.quant-gas
math.FA math.CO
hep-ex nucl-ex
quant-ph physics.ed-ph
astro-ph.GA
math.AG math.DG math.RT
math.AP math.PR
math.DG math-ph math.AP math.MP
cond-mat.mtrl-sci cond

cond-mat.str-el cond-mat.mes-hall cond-mat.mtrl-sci cond-mat.other
physics.plasm-ph hep-th
physics.comp-ph physics.ed-ph
physics.soc-ph cond-mat.dis-nn nlin.AO q-bio.MN
quant-ph physics.class-ph
hep-ex astro-ph.HE
eess.IV cs.CV cs.LG physics.med-ph q-bio.QM
math.GT gr-qc math-ph math.MP math.QA
physics.gen-ph quant-ph
physics.atom-ph physics.chem-ph physics.comp-ph
nlin.AO physics.comp-ph
cs.CY cs.HC
hep-ph astro-ph.CO hep-ex hep-th
eess.SP physics.data-an physics.optics
cs.RO cs.CV math.OC
cond-mat.str-el cond-mat.mes-hall cond-mat.quant-gas hep-th
nlin.PS physics.plasm-ph
cs.GT cs.IR
physics.optics nlin.CD nlin.PS
physics.flu-dyn astro-ph.SR
physics.acc-ph cond-mat.other hep-th
nlin.PS cond-mat.soft
cs.LG cs.RO cs.SY eess.SY stat.ML
physics.soc-ph cond-mat.stat-mech cs.CY cs.SI
gr-qc physics.class-ph
nucl-th physics.optics
physics.optics nlin.PS physics.comp-ph
physics.chem-ph physics.atom-ph quant-ph
quant-ph math.NT
math.CA math.DS
cs.CL cs.AI cs.IR cs.LG
eess.SP cs.GT cs.SY eess.S

In [10]:

ml_df = df[df['categories'].str.contains("cs.LG")]

sentencesList= ml_df['abstract'].tolist()

In [11]:
print(len(ml_df))


1077


In [12]:
print(sentencesList[0])

  The extraction and understanding of temporal events and their relations are
major challenges in natural language processing. Processing text on a
sentence-by-sentence or expression-by-expression basis often fails, in part due
to the challenge of capturing the global consistency of the text. We present an
ensemble method, which reconciles the outputs of multiple classifiers of
temporal expressions across the text using integer programming. Computational
experiments show that the ensemble improves upon the best individual results
from two recent challenges, SemEval-2013 TempEval-3 (Temporal Annotation) and
SemEval-2016 Task 12 (Clinical TempEval).



In [13]:
ml_df.iloc[1]['id']

'1611.10351'

In [14]:
import time
start_time = time.time()
embeddings = model.encode(sentencesList, convert_to_tensor=True)
end_time = time.time()
print("Time for computing embeddings:"+ str(end_time-start_time) )


Time for computing embeddings:283.8583347797394


In [15]:
print(np.shape(embeddings))

torch.Size([1077, 768])


In [20]:
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

In [21]:
count = 0
for scores in cosine_scores:
    scores[count]=0.0
    max_elements, max_indices = torch.max(scores,dim=0)
    max_index = max_indices.item()
    print("\n*********\n")
    print("**Paper Id :"+ml_df.iloc[count]['id']+' '+'\nTitle :'+ml_df.iloc[count]['title']
          +'\n'+ml_df.iloc[count]['abstract']+
          '\n**Paper Id :' +ml_df.iloc[max_index]['id']+' '+'\nTitle :'+
          ml_df.iloc[max_index]['title']+'\n'+
          ml_df.iloc[max_index]['abstract'])
    count =count+1



*********

**Paper Id :1412.1866 
Title :Integer-Programming Ensemble of Temporal-Relations Classifiers
  The extraction and understanding of temporal events and their relations are
major challenges in natural language processing. Processing text on a
sentence-by-sentence or expression-by-expression basis often fails, in part due
to the challenge of capturing the global consistency of the text. We present an
ensemble method, which reconciles the outputs of multiple classifiers of
temporal expressions across the text using integer programming. Computational
experiments show that the ensemble improves upon the best individual results
from two recent challenges, SemEval-2013 TempEval-3 (Temporal Annotation) and
SemEval-2016 Task 12 (Clinical TempEval).

**Paper Id :2004.01546 
Title :Temporarily-Aware Context Modelling using Generative Adversarial
  Networks for Speech Activity Detection
  This paper presents a novel framework for Speech Activity Detection (SAD).
Inspired by the recent s

**Paper Id :1911.01004 
Title :Why Non-myopic Bayesian Optimization is Promising and How Far Should We
  Look-ahead? A Study via Rollout
  Lookahead, also known as non-myopic, Bayesian optimization (BO) aims to find
optimal sampling policies through solving a dynamic programming (DP)
formulation that maximizes a long-term reward over a rolling horizon. Though
promising, lookahead BO faces the risk of error propagation through its
increased dependence on a possibly mis-specified model. In this work we focus
on the rollout approximation for solving the intractable DP. We first prove the
improving nature of rollout in tackling lookahead BO and provide a sufficient
condition for the used heuristic to be rollout improving. We then provide both
a theoretical and practical guideline to decide on the rolling horizon
stagewise. This guideline is built on quantifying the negative effect of a
mis-specified model. To illustrate our idea, we provide case studies on both
single and multi-information


**Paper Id :2004.01899 
Title :A Generic Graph-based Neural Architecture Encoding Scheme for
  Predictor-based NAS
  This work proposes a novel Graph-based neural ArchiTecture Encoding Scheme,
a.k.a. GATES, to improve the predictor-based neural architecture search.
Specifically, different from existing graph-based schemes, GATES models the
operations as the transformation of the propagating information, which mimics
the actual data processing of neural architecture. GATES is a more reasonable
modeling of the neural architectures, and can encode architectures from both
the "operation on node" and "operation on edge" cell search spaces
consistently. Experimental results on various search spaces confirm GATES's
effectiveness in improving the performance predictor. Furthermore, equipped
with the improved performance predictor, the sample efficiency of the
predictor-based neural architecture search (NAS) flow is boosted. Codes are
available at https://github.com/walkerning/aw_nas.

**Paper


*********

**Paper Id :2009.08720 
Title :Contextual Semantic Interpretability
  Convolutional neural networks (CNN) are known to learn an image
representation that captures concepts relevant to the task, but do so in an
implicit way that hampers model interpretability. However, one could argue that
such a representation is hidden in the neurons and can be made explicit by
teaching the model to recognize semantically interpretable attributes that are
present in the scene. We call such an intermediate layer a \emph{semantic
bottleneck}. Once the attributes are learned, they can be re-combined to reach
the final decision and provide both an accurate prediction and an explicit
reasoning behind the CNN decision. In this paper, we look into semantic
bottlenecks that capture context: we want attributes to be in groups of a few
meaningful elements and participate jointly to the final decision. We use a
two-layer semantic bottleneck that gathers attributes into interpretable,
sparse groups, a

In [17]:
print(len(embeddings))

1077


In [18]:
print(np.shape(embeddings))

torch.Size([1077, 768])


In [19]:
print(np.shape(cosine_scores))

torch.Size([1077, 1077])
