<a href="https://colab.research.google.com/github/zeeshansalim1234/hackathon/blob/main/hackathon_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search in Publications

This notebook demonstrates how [sentence-transformers](https://www.sbert.net) and the [SPECTER](https://github.com/allenai/specter) model can be used to find similar publications.

As corpus, we use all EMNLP publications from 2016 - 2018.

We then search for similar papers using papers that have been presented at EMNLP 2019 / 2020.


In [None]:
!pip install sentence-transformers
!pip install -U -q PyDrive
!pip install google-cloud-vision

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cc/75/df441011cd1726822b70fbff50042adb4860e9327b99b346154ead704c44/sentence-transformers-1.2.0.tar.gz (81kB)
[K     |████████████████████████████████| 81kB 6.1MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 12.9MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 47.8MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (

In [None]:
import json
import os
from sentence_transformers import SentenceTransformer, util
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
from google.colab import drive
import os,io
from google.cloud import vision

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'seismic-diorama-316110-5569927e0d86.json'
client = vision.ImageAnnotatorClient()


In [None]:
#First, we load the papers dataset (with title and abstract information)
dataset_file = 'emnlp2016-2018.json'

if not os.path.exists(dataset_file):
  util.http_get("https://sbert.net/datasets/emnlp2016-2018.json", dataset_file)

with open(dataset_file) as fIn:
  papers = json.load(fIn)

print(len(papers), "papers loaded")
 

HBox(children=(FloatProgress(value=0.0, max=1104641.0), HTML(value='')))


974 papers loaded


In [None]:
#We then load the allenai-specter model with SentenceTransformers
model = SentenceTransformer('allenai-specter')

#To encode the papers, we must combine the title and the abstracts to a single string
paper_texts = [paper['title'] + '[SEP]' + paper['abstract'] for paper in papers]

#Compute embeddings for all papers
corpus_embeddings = model.encode(paper_texts, convert_to_tensor=True)


HBox(children=(FloatProgress(value=0.0, max=408146277.0), HTML(value='')))




In [None]:
#We define a function, given title & abstract, searches our corpus for relevant (similar) papers
from termcolor import colored

def search_papers(title):
  query_embedding = model.encode(title+'[SEP]', convert_to_tensor=True)

  count = 0

  search_hits = util.semantic_search(query_embedding, corpus_embeddings)
  search_hits = search_hits[0]  #Get the hits for the first query

  print("Query:", title)
  print("\nMost similar papers:")
  for hit in search_hits:
    count+=1
    related_paper = papers[hit['corpus_id']]
    print()
    print(str(count)+". "+colored(related_paper['title'],'red'))
    print(related_paper['abstract'])
    print(related_paper['url'])



In [None]:

with io.open("nlp.jpg", 'rb') as image_file:
    content = image_file.read()

image = vision.Image(content=content)
response = client.document_text_detection(image=image)

docText = response.full_text_annotation.text

search_papers(docText)

Query: is a
Natural language
research project,
key component of my


Most similar papers:

1. [31mNatural Language Processing with Small Feed-Forward Networks[0m
Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in fields such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques. The tutorial covers input encoding for natural language tasks, feed-forward networks, convolutional networks, recurrent networks and recursive networks, as well as the computation graph abstraction for automatic gradient computation.
http://aclweb.org/anthology/D17-1309

2. [31mLearning Translations via Matrix Co