### Evaluation of the TF-IDF ranking implemented in MongoDB for probabilistic full text search

We configurated a mongodb instance in a docker container using 16GB of RAM and 4 cores and port forwarding to the host machine on port 27017. We indexed the 23.9m documents on the content field using the TF-IDF ranking.

In [1]:
from pymongo import MongoClient
import numpy as np
from pathlib import Path
import os
import json
from tqdm import tqdm

# Connect to MongoDB
client = MongoClient('localhost', 27017)
db = client['PubMed']
collection = db['Docs']

collections = db.list_collection_names()
print("Collections in der Datenbank:", collections)

Collections in der Datenbank: ['Docs', 'all_docs']


Now we define the search query and the number of results we want to retrieve. We only retrieve the PMIDs of the documents to compare the results with the relevant documents to the related queries by using the bioASQ dataset.

This funktion retrieves the PMIDs of the documents that contain the search query in the content field.

This query only retrieves documents that contain the search query in the content field without any ranking. Thus, the results are not sorted by relevance.

In [10]:
def search(query):
    results = collection.find({"$text": {"$search": query}}).limit(100)
    return results

In [14]:
pmid_liste = search("Is it possible to visualize subtahalamic nucleus by using transcranial ultrasound?")

list(pmid_liste)[:5]

[{'_id': ObjectId('6617aff5fb3d2cdc7cd9b435'),
  'id': 'pubmed23n0045_2126',
  'title': 'Distribution of somatostatin-28 (1-12) in the cat brainstem: an immunocytochemical study.',
  'content': 'We studied the distribution of somatostatin-28 (1-12)-immunoreactive fibers and cell bodies in the cat brainstem. A moderate density of cell bodies containing the peptide was observed in the ventral nucleus of the lateral lemniscus, accessory dorsal tegmental nucleus, retrofacial nucleus and in the lateral reticular nucleus, whereas a low density of such perikarya was found in the interpeduncular nucleus, nucleus incertus, nucleus sagulum, gigantocellular tegmental field, nucleus of the trapezoid body, nucleus praepositus hypoglosii, lateral and magnocellular tegmental fields, nucleus of the solitary tract, nucleus ambiguous and in the nucleus intercalatus. Moreover, a moderate density of somatostatin-28 (1-12)-immunoreactive processes was found in the dorsal nucleus of the raphe, dorsal tegmen

This funktion retrieves the PMIDs of the documents that contain the search query in the content field. The results are sorted by the TF-IDF ranking.

It takes significantly longer to retrieve the results because the documents are sorted by the TF-IDF ranking.

In [2]:
def search_TF_IDF(query, k):
    results = collection.find({"$text": {"$search": query}}, {"_id": 0, "PMID": 1, "score": {"$meta": "textScore"}}).sort([("score", {"$meta": "textScore"})]).limit(k)
    return results

MongoDB uses lazy evaluation. Thus, the query is not executed until the results are accessed. We access the results to measure the time it takes to retrieve the results.

In [3]:
pmid_liste = search_TF_IDF("Is it possible to visualize subtahalamic nucleus by using transcranial ultrasound?", 10)

The results are retrieved as a cursor. We convert the cursor to a list to access the results. This takes a while because the results are sorted by the TF-IDF ranking.

In [4]:
pmid_liste = list(pmid_liste)
pmid_liste

[{'PMID': 1627439, 'score': 3.31173513986014},
 {'PMID': 64478, 'score': 3.2818834459459465},
 {'PMID': 1705058, 'score': 3.2310779816513757},
 {'PMID': 6763082, 'score': 3.229017857142857},
 {'PMID': 2683312, 'score': 3.1279296875},
 {'PMID': 1650433, 'score': 3.095602766798419},
 {'PMID': 2473416, 'score': 3.095472440944882},
 {'PMID': 3473897, 'score': 3.07967032967033},
 {'PMID': 1519071, 'score': 3.0697115384615383},
 {'PMID': 3545257, 'score': 3.0483333333333333}]

25 seconds are needed to retrieve the results for the query. This time is impractical for a real-time search engine.