# Overview comparing lexical search with semantic search
The below gives a simple example of lexical search with TF/IDF and semantic search with transformers for an example query as compared to a set of documents from a science/medicine forum.

A query related to "managing blood sugar" returns a post:
- related to blood donors and travel based on lexical search
- related to hypoglycemia/diabetes with semantic search

this notebook also has a widget to test out queries with either lexical or semantic search to give an intuition on their behavior.

In [1]:
%%capture
## capture with jupyter magic to suppress output

## install needed libraries
!pip install datasets transformers sentence_transformers nltk gradio

In [2]:
from sklearn.datasets import fetch_20newsgroups

# Load corpus
corpus = fetch_20newsgroups(categories=['sci.med'], remove=('headers', 'footers', 'quotes'))

# Example lexical search

In [3]:
# elasticsearch is a terrific tool that traditionally prioitized speed of search by building an index based on word frequency
# using 'BM25' similarity - an inverse term frequency measure where words that distinguish a document can retrieve that document
# [https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html]
# this seems to do ok in some contexts, but often fails when the distinguishing words are out of context
# or are insufficient/misleading on their own
# here the keyword blood in the phrase "managing blood sugar" picks a document not related to hypogycemia/diabetes

import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialise tf-idf tokenizer
vectorizer = TfidfVectorizer()

# Fit vectorizer to corpus
features = vectorizer.fit_transform(corpus.data)

# make and transform the query
query = ["how do I manage my blood sugar"]   #<- query to enter into models
query_tfidf = vectorizer.transform(query)

# Calculate pairwise similarity with all documents in corpus
cosine_similarities = cosine_similarity(features, query_tfidf).flatten()

# display the best matching document (shortest distance between doc and query)
print(corpus.data[np.argsort(cosine_similarities)[-1]])



The FDA, I believe.  Rules say no blood or blood products donations
from anyone who has been in a malarial area for 3 years.  I was a platelet
donor until my Thailand trip and my blood bank was very disappointed
to find out they couldn't use me for 3 years.

Not necessarily.  The same rules may not apply to organ donation
as to blood donation.  In fact, I'm sure they don't.



-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 


# Example semantic search

In [4]:
%%capture

# while cosine similarity can work ok based on keyword frequncy, better results are sometimes achieved with semantic similarity
# that is, when one uses both the context of the language of the query and the document
# one such model that can have this context in a vector representation for cosine similarity is mpnet

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

In [5]:
# Sentences/documents are encoded as vector representations by calling model.encode()
# if not on gpu enabled instance this may take some time
embeddings = model.encode(corpus.data,show_progress_bar=True, convert_to_numpy=True)


Batches:   0%|          | 0/19 [00:00<?, ?it/s]

In [6]:
# we then apply the same search algorithm on the mpnet embeddigns (taking semantic context into account)
# (cosine similarity between documents and query)
# this seems to pull out more relevent documents based on synonyms and other contextually similar information
# here the phrase "blood sugar" now picks a document on hypogycemia/diabetes

query='how do I manage my blood sugar'
query_embedding = model.encode(query, convert_to_numpy=True).reshape(1, -1)

# Calculate pairwise similarity with all documents in corpus
cosine_similarities = cosine_similarity(embeddings, query_embedding).flatten()

# display the best matching document (shortest distance between doc and query)
print(corpus.data[np.argsort(cosine_similarities)[-1]])




Check out the DIABETIC mailing list -- a knowledgable, helpful, friendly,
voluminous bunch.  Send email to LISTSERV@PCCVM.BITNET, with this line
in the body:

SUBSCRIBE DIABETIC <your name here>

Also, the vote for misc.health.diabetes, a newsgroup for general discussion
of diabetes, is currently underway, and will close on 29 April.  From the
2nd CFV, posted to news.announce.newgroups, news.groups, and sci.med,
message <1q1jshINN4v1@rodan.UU.NET>:


# Example interactive search

In [8]:
import gradio as gr

## Running this cell brings up an interactive gradio app
## to experiment with queries and the output from a lexical
## or semantic search module
## assumes previous cells have been run to calculate representation of corpus

# define a function to search with either method
def search_with_mpnet_tfidf(query , model_id):

    if model_id == "mpnet":
        # search with mpnet
        query_embedding = model.encode(query, convert_to_numpy=True).reshape(1, -1)
        mp_cosine_similarities = cosine_similarity(embeddings, query_embedding).flatten()
        search_out=corpus.data[np.argsort(mp_cosine_similarities)[-1]]

    elif model_id == "tf-idf":
        # search with tfidf
        query_tfidf = vectorizer.transform([query])
        tf_cosine_similarities = cosine_similarity(features, query_tfidf).flatten()
        search_out=corpus.data[np.argsort(tf_cosine_similarities)[-1]]

    return search_out

# define the gradio interface
search_demo = gr.Interface(fn=search_with_mpnet_tfidf, inputs=["text",gr.Radio(["tf-idf","mpnet"])], outputs=["text"])

# launch the gradio interface
search_demo.launch(debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Keyboard interruption in main thread... closing server.


