#  Semantic search with natural language output
Embeddings of the corpus are created and are compared with the search query. The top 3 results are fed to a LLM and the LLM summarizes them together and communicates it to the user.

- Embedding model = Instructor-large
- Vector db = Weaviate
- LLM = None / openai
- Bringing it all together = langchain

Ref tutorial: https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/weaviate

Weaviate: https://console.weaviate.cloud/dashboard

In [None]:
% conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia
# % conda install pytorch cpuonly -c pytorch
%pip install transformers==4.30.2
%pip install langchain==0.0.207
%pip install weaviate-client==3.21.0 
%pip install openai==0.27.8
%pip install InstructorEmbedding==1.0.1
%pip install sentence_transformers==2.2.2
%pip install python-dotenv

In [12]:
import os
import dotenv
dotenv.load_dotenv()
weaveiate_url = 'https://poc-cluster-gn6v0ngr.weaviate.network'

In [15]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings

loader = TextLoader("../src_docs/Locale.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = HuggingFaceInstructEmbeddings(
    query_instruction="Represent the query for retrieval: ",
    model_name='hkunlp/instructor-large',
    cache_folder='../models'
)

Created a chunk of size 1071, which is longer than the specified 1000
Created a chunk of size 1325, which is longer than the specified 1000
Created a chunk of size 1204, which is longer than the specified 1000
Created a chunk of size 1559, which is longer than the specified 1000
Created a chunk of size 1482, which is longer than the specified 1000
Created a chunk of size 1550, which is longer than the specified 1000
Created a chunk of size 1067, which is longer than the specified 1000
Created a chunk of size 2709, which is longer than the specified 1000
Created a chunk of size 1747, which is longer than the specified 1000
Created a chunk of size 1278, which is longer than the specified 1000
Created a chunk of size 1438, which is longer than the specified 1000
Created a chunk of size 1819, which is longer than the specified 1000


load INSTRUCTOR_Transformer
max_seq_length  512


In [16]:
from langchain.vectorstores import Weaviate

db = Weaviate.from_documents(docs, embeddings, weaviate_url=weaveiate_url, by_text=False)

In [31]:
query = "What is their target audience?"
docs = db.similarity_search(query)
docs[0].page_content

'* Product/ BD teams: Consume the dashboard for strategic company decisions\n* Supply/ Ops teams: Creating triggers based on certain events\n* Marketing teams: Integrating it with any promotions product they already use such as Clevertap\n* CXOs: A daily image for progress\n* Data Scientists: Can export the metrics, profiles and use it in their models\n\n\n4. Models: \n\n\nUser personas: Data Scientists who can customize some of the pre-built models and fetch the results via an API.\n\n\nA Usual Scenario:'

In [32]:
docs = db.similarity_search_with_score(query, by_text=False)
docs

[(Document(page_content='* Product/ BD teams: Consume the dashboard for strategic company decisions\n* Supply/ Ops teams: Creating triggers based on certain events\n* Marketing teams: Integrating it with any promotions product they already use such as Clevertap\n* CXOs: A daily image for progress\n* Data Scientists: Can export the metrics, profiles and use it in their models\n\n\n4. Models: \n\n\nUser personas: Data Scientists who can customize some of the pre-built models and fetch the results via an API.\n\n\nA Usual Scenario:', metadata={'_additional': {'vector': [-0.036728065, 0.0045636753, 0.015213454, -0.0034275155, 0.062997796, 0.030679079, -0.008486036, -0.007045085, -0.012307138, 0.05339873, 0.052098054, -0.0036522415, 0.016076673, 0.030469503, -0.056089234, -0.049058758, -0.03110545, -0.0070288805, -0.06756276, 0.0025541952, 0.02433158, -0.009342339, -0.00854825, -0.0051109693, -0.018910272, -0.0033044037, -0.0012395792, 0.03518661, 0.033290416, -0.072650805, 0.03274688, -0.0

In [24]:
retriever = db.as_retriever(search_type="mmr")
retriever.get_relevant_documents(query)

[Document(page_content='But what about when we have most of our operations in the physical world? In any business where you have an on-ground presence, we have our users on one side and our resources (cycles, delivery people, cars) on the other side. \n\n\nOur ultimate aim in life is to get to your users as fast as possible by making the best use of your resources. This mammoth task of matching our supply with demand comes with its own bundle of challenges because the real world is far more chaotic and fickle.\n\n\nHow does the funnel of a delivery person look like? Which lap does he spend the maximum in time and in which locations? If he is idle, where should he move?\n\n\nThe problem is a BIG one.', metadata={'source': '../src_docs/Locale.txt'}),
 Document(page_content='Ingestion server - \nLogs\n                     1. Timestamp of requests sent to the database\n                     2. Status of the request - error/success\n                     3. Processing time for request to be w

In [14]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain import OpenAI

chain = RetrievalQAWithSourcesChain.from_chain_type(
    OpenAI(temperature=0), chain_type="stuff", retriever=db.as_retriever()
)
chain(
    {"question": "Who is Sherlock's best friend?"},
    return_only_outputs=True,
)

{'answer': " Sherlock Holmes' best friend is Dr. John Watson.\n",
 'sources': '../src_docs/sherlockholmes.txt'}