<a target="_blank" href="https://colab.research.google.com/github/shaankhosla/semanticsearch/blob/main/notebooks/Mistral_7b_rag.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Modifed from Qdrant rag [example](https://colab.research.google.com/github/qdrant/examples/blob/master/rag-openai-qdrant/rag-openai-qdrant.ipynb)


In [1]:
%%capture
import locale
locale.getpreferredencoding = lambda: "UTF-8"

%pip install qdrant-client==1.5.4 fastembed==0.0.4 langchain==0.0.350
%pip install -q -U transformers==4.42.3 accelerate==0.32.1 bitsandbytes==0.43.1

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import qdrant_client
from google.colab import userdata

In [3]:
# Initialize Qdrant collection

client = qdrant_client.QdrantClient(":memory:")
client.get_collections()

CollectionsResponse(collections=[])

In [4]:
# Add some sentences

client.add(
    collection_name="knowledge-base",
    documents=[
        "Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!",
        "Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.",
        "PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.",
        "MySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database.",
        "NGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.",
        "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.",
        "SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.",
        "The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals.",
    ],
)

['96ec93292087409ba33496dbd8684fe8',
 '60dee5ab80f54d67970e279633cd5eaa',
 'c6157bef41d549f8b69eca5a854f7cc2',
 '12c00104fd9e4f1c918feee849a33e41',
 '17c8b2ff792e48a18f83e550f6054bbd',
 '690917b764794c168d3c167c47d0ad79',
 '1ad013a0b01a459188aaf07496a53cdd',
 '1b8c178d73b94cf797eb41e1e80093dc']

In [7]:
# Load in Mistral model and tokenizer

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype="float16",
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=userdata.get("huggingface"), trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    token=userdata.get("huggingface"),
)

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [10]:
# Generate text with no external knowledge


def text_generate(prompt):
    prompt = "[INST] " + prompt + " [/INST]"
    inputs_not_chat = tokenizer.encode_plus(prompt, return_tensors="pt")[
        "input_ids"
    ].to("cuda")

    generated_ids = model.generate(inputs_not_chat, max_new_tokens=1000, do_sample=True)
    decoded = tokenizer.batch_decode(generated_ids)
    return decoded


prompt = """What tools should I need to use to build a web service using vector embeddings for search?"""
text_generate(prompt)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> [INST] What tools should I need to use to build a web service using vector embeddings for search? [/INST] To build a web service using vector embeddings for search, you will need the following tools:\n1. Vector space library: You will need a library for working with vector spaces, such as `numpy`, `scipy`, or `torch` (PyTorch) in Python.\n2. Text preprocessing library: You will need a library for text preprocessing, such as `nltk`, `spaCy`, or `gensim` in Python.\n3. Embedding model: You will need a library for training and using pre-trained or custom embeddings, such as `word2vec`, `gloVe`, or `BERT` (PyTorch).\n4. Similarity search algorithm: You will need an algorithm for finding similar documents, such as the Cosine similarity or Jaccard similarity.\n5. Search engine: You can use a search engine library like `elasticsearch`, `solr`, or `whoosh` to deploy your web search service. You can also use a cloud-based search engine like Amazon Elasticsearch or Google Cloud Search.\n6.

In [11]:
# Use question to query vector DB
results = client.query(
    collection_name="knowledge-base",
    query_text=prompt,
    limit=3,
)
for r in results:
    print(r.document, "\n")

context = "\n".join([r.document for r in results])

Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more! 

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. 

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining. 



In [12]:
# Use semantic search results in prompt

metaprompt = f"""
You are a software architect.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: {prompt.strip()}

Context:
{context.strip()}

Answer:
"""

# Look at the full metaprompt
print(metaprompt)


You are a software architect.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: What tools should I need to use to build a web service using vector embeddings for search?

Context:
Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.
SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences wi

In [13]:
text_generate(metaprompt)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> [INST] \nYou are a software architect.\nAnswer the following question using the provided context.\nIf you can\'t find the answer, do not pretend you know it, but answer "I don\'t know".\n\nQuestion: What tools should I need to use to build a web service using vector embeddings for search?\n\nContext:\nQdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!\nFastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.\nSentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-simila

In [14]:
def rag(question: str, n_points: int = 3) -> str:
    results = client.query(
        collection_name="knowledge-base",
        query_text=question,
        limit=n_points,
    )

    context = "\n".join(r.document for r in results)

    metaprompt = f"""
    You are a software architect.
    Answer the following question using the provided context.
    If you can't find the answer, do not pretend you know it, but answer "I don't know".

    Question: {question.strip()}

    Context:
    {context.strip()}

    Answer:
    """

    return text_generate(metaprompt)

Now it's easier to ask a broad range of questions.


In [15]:
rag("What can the stack for a web api look like?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> [INST] \n    You are a software architect.\n    Answer the following question using the provided context.\n    If you can\'t find the answer, do not pretend you know it, but answer "I don\'t know".\n\n    Question: What can the stack for a web api look like?\n\n    Context:\n    FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.\nNGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.\nDocker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.\n\n    Answer:\n     [/INST] The stack for a FastAPI web API using NGINX as a reverse proxy and Docker for containerization can look like this:\n\n* Front-end: JavaScript framework/library (e.g. React, Angular)\n* B

In [16]:
rag("Where is the nearest grocery store?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> [INST] \n    You are a software architect.\n    Answer the following question using the provided context.\n    If you can\'t find the answer, do not pretend you know it, but answer "I don\'t know".\n\n    Question: Where is the nearest grocery store?\n\n    Context:\n    Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.\nQdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!\nMySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract d