## Consistency check of synthetic queries

This notebook uses the synthetic queries generated by generate_synthetic_data_using_t5.ipynb and generates
training data. 

In [39]:
!pip3 install --upgrade pandas requests transformers pyarrow

Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.25.1
    Uninstalling transformers-4.25.1:
      Successfully uninstalled transformers-4.25.1
Successfully installed transformers-4.26.0


In [1]:
import pandas as pd
import torch

In [40]:
queries = pd.read_csv('trec-covid-queries.tsv', delimiter='\t', names=['id','query'])

In [41]:
queries

Unnamed: 0,id,query
0,5p68npdb,query: what is the name of the virus that caus...
1,pd1g119c,query: what is the title of the abstract that ...
2,y5gmlsi1,query: what tyrosine kinases are involved in d...
3,7o1hprbe,query: what is the research activity on probio...
4,rbrjlz25,query: What is the effect of climate change on...
...,...,...
33095,kcfhnvqg,query: what is the best surgical treatment for...
33096,v4wre1qk,query: what is the agent based modeling of the...
33097,2v4izbiq,query: what is the name of the virus that caus...
33098,y6jnbp81,query: what is the vascular integrity of the a...


In [42]:
queries['clean_query'] = queries['query'].apply(lambda x: x.lower().replace('query:','').strip())

In [43]:
queries

Unnamed: 0,id,query,clean_query
0,5p68npdb,query: what is the name of the virus that caus...,what is the name of the virus that caused resp...
1,pd1g119c,query: what is the title of the abstract that ...,what is the title of the abstract that is titl...
2,y5gmlsi1,query: what tyrosine kinases are involved in d...,what tyrosine kinases are involved in dengue v...
3,7o1hprbe,query: what is the research activity on probio...,what is the research activity on probiotics in...
4,rbrjlz25,query: What is the effect of climate change on...,what is the effect of climate change on human ...
...,...,...,...
33095,kcfhnvqg,query: what is the best surgical treatment for...,what is the best surgical treatment for adenoc...
33096,v4wre1qk,query: what is the agent based modeling of the...,what is the agent based modeling of the spread...
33097,2v4izbiq,query: what is the name of the virus that caus...,what is the name of the virus that caused the ...
33098,y6jnbp81,query: what is the vascular integrity of the a...,what is the vascular integrity of the arterial...


In [6]:
import numpy as np

The following queries the index using the generated synthetic query. If the query does not retrieve
the document that generated the query in position 0, the query is discared and filtered out. 

If the query is retrieving the document at position 0, we sample 2 other documents among the top 100 hits
as negatives. 

In [31]:
def search(row):
    query = row['clean_query']
    doc_id = row['id']
    query_request = {
        'yql': 'select title, abstract, matchfeatures, cord_uid from doc where {"grammar":"tokenize", "targetHits":200}userInput(@query)',
        'query': query, 
        'ranking': 'hybrid-colbert',
        'bolding': 'false',
        'hits' : 100,  
        'language' : 'en', 
        'timeout' : '20s',
        'summary': 'short'
    }
    try:
        response = session.post("http://localhost:8080/search/", json=query_request,timeout=120)
    except:
        response = session.post("http://localhost:8080/search/", json=query_request,timeout=120)
    if response.ok:
        json_result = response.json()
        root = json_result['root']
        total_count = root['fields']['totalCount']
        
        positive_pairs = []
        negative_pairs = []
        
        if total_count > 0:
          pos = 0
          for hit in root['children']:
            id = hit['fields'].get('cord_uid')
            if id is None:
              continue
            relevant = False
            if id == doc_id and pos < 1:
              relevant = True
            title = hit['fields'].get('title')
            abstract = hit['fields'].get('abstract')
            relevance = hit['relevance']
            bm25 = hit['fields']['matchfeatures']['bm25']
            colbert = hit['fields']['matchfeatures']['colbert_maxsim']
            doc = {
              "query": query,
              "doc_id": id,
              "relevant": relevant,
              "title": title,
              "abstract": abstract,
              "score": relevance,
              "bm25": bm25,
              "colbert": colbert
            }
            pos = pos + 1
            if relevant:
              positive_pairs.append(doc)
            else:
              negative_pairs.append(doc)
        if len(positive_pairs) > 0:
          responses.append(positive_pairs[0])
          for n in np.random.choice(negative_pairs, size=2):
            responses.append(n)
          
    else:
      print("query request failed with " + str(response.json()))

In [32]:
import json
import requests
from requests.adapters import HTTPAdapter, Retry

In [33]:
global session
session = requests.Session()
retries = Retry(total=20, connect=20,
      backoff_factor=0.3,
      status_forcelist=[ 500, 503, 504, 429 ]
)
session.mount('https://', HTTPAdapter(max_retries=retries))
session.mount('http://', HTTPAdapter(max_retries=retries))


In [34]:
global responses
responses = []

In [35]:
queries.apply(search,axis=1)

0        None
1        None
2        None
3        None
4        None
         ... 
33095    None
33096    None
33097    None
33098    None
33099    None
Length: 33100, dtype: object

In [36]:
df_result = pd.DataFrame.from_records(responses)

In [37]:
df_result

Unnamed: 0,query,doc_id,relevant,title,abstract,score,bm25,colbert
0,what is the name of the virus that caused resp...,5p68npdb,True,Influenza A (H10N7) Virus Causes Respiratory T...,Avian influenza viruses sporadically cross the...,0.857566,76.120982,72.329134
1,what is the name of the virus that caused resp...,9x2z2hg1,False,Human bocavirus infection as a cause of severe...,Abstract In 2005 human bocavirus (HBoV) was di...,0.440396,26.244967,59.373533
2,what is the name of the virus that caused resp...,0qdjk7e0,False,Respiratory syncytial virus and human rhinovir...,Respiratory infections are very common in Kuwa...,0.442540,28.187150,57.838307
3,what tyrosine kinases are involved in dengue v...,y5gmlsi1,True,Identification and characterization of the rol...,We screened a siRNA library targeting human ty...,0.925027,53.349261,82.207237
4,what tyrosine kinases are involved in dengue v...,1cd27vgj,False,Intrahost selection pressures drive rapid deng...,"Dengue, caused by four dengue virus serotypes ...",0.494465,22.794240,58.499635
...,...,...,...,...,...,...,...,...
42463,what is the vascular integrity of the arterial...,d8pf9s3k,False,Influence of Thrombus Composition on Thrombect...,"PURPOSE A first-pass, direct aspiration techni...",0.599344,43.753466,59.496343
42464,what is the vascular integrity of the arterial...,42vglg9p,False,Fibrin Clot Architecture in Acute Ischemic Str...,BACKGROUND The composition of intra-arterial c...,0.615178,43.899136,61.735330
42465,what is the learning curve of a young surgeon'...,hje6lzip,True,Learning Curve of a Young Surgeon's Video-assi...,BACKGROUND The purpose of this paper is to pre...,0.944653,109.730832,77.457502
42466,what is the learning curve of a young surgeon'...,g0clai34,False,Video-assisted thoracic surgery right sleeve l...,A 50-year-old active male with a smoking histo...,0.553823,52.099838,57.593029


In [38]:
df_result.to_parquet("train_data_k1.parquet")