Skip to content

Latest commit

 

History

History
101 lines (79 loc) · 4.65 KB

cord-19-queries.md

File metadata and controls

101 lines (79 loc) · 4.65 KB

Vespa powered search of the CORD-19 dataset

Query API

The default frontend query language searches uses weakAnd.

Similar articles

The similar articles feature is implemented using approximate nearest neighbor search over the 768 dimensional specter embeddings. Since the dataset contains a lot of duplicates, we also remove articles that are too similar as measured by cosine similarity.

Deduping

The CORD-19 dataset has a lot of near duplicates, for all search requests, we dedup the top 100 results, using the specter embeddings with document-to-document similarity and a similarity threshold. The dedup functionality is implemented in a Searcher

API Access

For using the Search Api of Vespa please see API documentation, YQL Query Language. For the full document definition see doc.sd.

High level field description

These are the most important fields in the dataset

field source in CORD-19 indexed/searchable summary (returned with hit) available for grouping matching Vespa type
default title + abstract yes no no tokenized and stemmed (match:text) fieldset
title title from metadata yes yes with bolding no tokenized and stemmed (match:text) string
abstract abstract from metadata yes yes with bolding and dynamic summary no tokenized and stemmed (match:text) string
journal journal yes yes yes exact matching string
source source yes yes yes exact matching string
doi https:// + doi from metadata no yes no no string
id row id from metadata.csv yes yes yes yes int
authors authors in metadata or authors from sha json if found yes using sameElement() yes yes yes array of struct

Ranking

See Vespa's Ranking documentation. There are two ranking profiles available:

Ranking Description
bm25 Linear sum: bm25(title) + bm25(abstract)
colbert Linear sum of colbert maxsim over title and abstract

See Vespa BM25 and ColBERT.

The ranking profiles are defined in the document definition (doc.sd).

Example API queries

For using the Search Api of Vespa please see API documentation, YQL Query Language. In the below examples we use python with the requests api, using the POST search api.

import requests 

#Search for documents matching all query terms (either in title or abstract)
search_request_all = {
  'yql': 'select id, title, abstract, doi from sources * where userQuery();',
  'hits': 5,
  'summary': 'short',
  'query': 'coronavirus temperature sensitivity',
  'type': 'all',
  'ranking': 'bm25'
}

#Search for documents matching any of query terms (either in title or abstract)
search_request_any = {
  'yql': 'select id, title, abstract, doi from sources * where userQuery();',
  'hits': 5,
  'summary': 'short',
  'query': 'coronavirus temperature sensitivity',
  'type': 'any',
  'ranking': 'colbert'
}

#Search for documents matching with weak and of query terms (either in title or abstract)
search_request_any = {
  'yql': 'select id, title, abstract, doi from sources * where userQuery();',
  'hits': 5,
  'summary': 'short',
  'query': 'coronavirus temperature sensitivity',
  'type': 'weakAnd',
  'ranking': 'colbert'
}

#Search authors which is an array of struct using sameElement operator
search_request_authors= {
  'yql': 'select id,authors from sources * where authors contains sameElement(first contains "Keith", last contains "Mansfield");',
  'hits': 5,
  'summary': 'short'
}

#Sample request 
endpoint='https://api.cord19.vespa.ai/search/'
response = requests.post(endpoint, json=search_request_all)