Whenever vector store size is big, it can be very hard for the end user to frame the correct query string, to retrieve the correct similarity search

Instead of worrying about generating the correct phrasing of query, we can use multiquery vector which can generate multiple versions of the query and this can in turn be provided to llm or the vector store and it can give you back better results 


In [1]:
llama2_paper_path = './LLaMA_Open_and_Efficient_Foundation_Language_Models.pdf'


#### Load documents and Retriever Setup

In [2]:
# import the LangChain pdf document loader
from langchain.document_loaders import PyPDFLoader


In [3]:
# Load and create pages
loader = PyPDFLoader(file_path=llama2_paper_path)
pages = loader.load_and_split()


In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader(file_path=llama2_paper_path)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
documents = text_splitter.split_documents(documents)


In [5]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


In [6]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(documents,
                           embedding=embeddings)


#### Prepare Prompt Template

In [None]:
from langchain import PromptTemplate

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

# Default system prompt
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, 
while being safe. Your answers should not include any harmful, unethical, racist, sexist, 
toxic, dangerous, or illegal content. Please ensure that your responses are socially 
unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of 
answering something not correct. If you don't know the answer to a question, please don't 
share false information.

Always say "thanks for asking!" at the end of the answer. """

def get_prompt_template(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT):
    System_PROMPT = B_SYS + new_system_prompt + E_SYS
    PromptTemplate = B_INST + System_PROMPT + instruction + E_INST

    return PromptTemplate

instruction = '''Use the following pieces of context to answer the question at the end. 
{context}
Question: {question}\n' 
Helpful Answer:'''

template = get_prompt_template(instruction)

prompt = PromptTemplate(template=template,
                        input_variables=["context", "question"])


#### LLM and HF Pipeline

In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from transformers import GPTQConfig
from langchain import HuggingFacePipeline

mname = "TheBloke/Mistral-7B-OpenOrca-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(mname)
tokenizer.pad_token = tokenizer.eos_token

quantization_config_loading = GPTQConfig(bits=4, 
                                         disable_exllama=True, 
                                         use_cuda_fp16=True,
                                         tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(mname,
                                             quantization_config=quantization_config_loading,
                                             device_map="auto")

model.eval()

pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                max_new_tokens = 256,
                do_sample=True,
                top_k=1,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id,
                repetition_penalty=1.2
                )

llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0})


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.
You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
CUDA extension not installed.
CUDA extension not installed.


In [8]:
from langchain.globals import set_verbose

set_verbose(True)

from langchain.globals import set_debug

set_debug(True)

# Set logging for the queries
import logging
logging.basicConfig()
logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO)


#### Muilti-Query Retriever

In [9]:
from langchain.retrievers.multi_query import MultiQueryRetriever

mq_retriever = MultiQueryRetriever.from_llm(llm=llm,
                                            retriever=db.as_retriever())


In [10]:
query = "what is so special about llama 2?"

output = mq_retriever.get_relevant_documents(query=query)


[32;1m[1;3m[chain/start][0m [1m[1:retriever:Retriever > 2:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "what is so special about llama 2?"
}
[32;1m[1;3m[llm/start][0m [1m[1:retriever:Retriever > 2:chain:LLMChain > 3:llm:HuggingFacePipeline] Entering LLM run with input:
[0m{
  "prompts": [
    "You are an AI language model assistant. Your task is \n    to generate 3 different versions of the given user \n    question to retrieve relevant documents from a vector  database. \n    By generating multiple perspectives on the user question, \n    your goal is to help the user overcome some of the limitations \n    of distance-based similarity search. Provide these alternative \n    questions separated by newlines. Original question: what is so special about llama 2?"
  ]
}


INFO:langchain.retrievers.multi_query:Generated queries: ['What makes Llama 2 unique or notable in comparison to other models?', '    Why has Llama 2 gained attention and popularity among researchers and users alike?', '    How does Llama 2 differ from previous language models in terms of its capabilities and performance?']


[36;1m[1;3m[llm/end][0m [1m[1:retriever:Retriever > 2:chain:LLMChain > 3:llm:HuggingFacePipeline] [71.88s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "\n    \n    What makes Llama 2 unique or notable in comparison to other models?\n    Why has Llama 2 gained attention and popularity among researchers and users alike?\n    How does Llama 2 differ from previous language models in terms of its capabilities and performance?",
        "generation_info": null,
        "type": "Generation"
      }
    ]
  ],
  "llm_output": null,
  "run": null
}
[36;1m[1;3m[chain/end][0m [1m[1:retriever:Retriever > 2:chain:LLMChain] [71.88s] Exiting Chain run with output:
[0m[outputs]


In [11]:
type(output)


list

In [12]:
output


[Document(page_content='els that have not been ﬁnetuned on code, namely\nPaLM and LaMDA (Thoppilan et al., 2022). PaLM\nand LLaMA were trained on datasets that contain\na similar number of code tokens.\nAs show in Table 8, for a similar number\nof parameters, LLaMA outperforms other gen-\neral models such as LaMDA and PaLM, which\nare not trained or ﬁnetuned speciﬁcally for code.\nLLaMA with 13B parameters and more outper-\nforms LaMDA 137B on both HumanEval and\nMBPP. LLaMA 65B also outperforms PaLM 62B,', metadata={'page': 5, 'source': './LLaMA_Open_and_Efficient_Foundation_Language_Models.pdf'}),
 Document(page_content='LLaMA-13B outperforms GPT-3 on most bench-\nmarks, despite being 10 ×smaller. We believe that\nthis model will help democratize the access and\nstudy of LLMs, since it can be run on a single GPU.\nAt the higher-end of the scale, our 65B-parameter\nmodel is also competitive with the best large lan-\nguage models such as Chinchilla or PaLM-540B.\nUnlike Chinchilla, PaL