<a target="_blank" href="https://colab.research.google.com/github/shaankhosla/semanticsearch/blob/main/notebooks/Mistral_7b_rag.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Modifed from Qdrant rag [example](https://colab.research.google.com/github/qdrant/examples/blob/master/rag-openai-qdrant/rag-openai-qdrant.ipynb)


In [46]:
%%capture
import locale
locale.getpreferredencoding = lambda: "UTF-8"

%pip install qdrant-client==1.5.4 fastembed==0.0.4 langchain==0.0.350
%pip install -q -U transformers==4.36.1 accelerate==0.25.0 bitsandbytes==0.41.3.post2

In [47]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline,
)
import torch
import qdrant_client

In [48]:
# Initialize Qdrant collection

client = qdrant_client.QdrantClient(":memory:")
client.get_collections()

CollectionsResponse(collections=[])

In [49]:
# Add some sentences

client.add(
    collection_name="knowledge-base",
    documents=[
        "Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!",
        "Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.",
        "PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.",
        "MySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database.",
        "NGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.",
        "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.",
        "SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.",
        "The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals.",
    ],
)

['14add611201b4ce09c495af7181ebc13',
 '54e5c94e20904906957c8c193262c0b0',
 '9b183aac14bc44dd881396766a0b9e6c',
 '028f7a41a2f64cd78b06787b9e3a0c58',
 '4292375dc80f4cd6befeb1a5300e555a',
 '3ff38677e1a045d19552cc6782a0ffd7',
 '65e948394dab4a7c926f1daf7230f93a',
 '97d96cc784b648b9a090f33b8531cd96']

In [50]:
# Load in Mistral model and tokenizer

model_name = "mistralai/Mistral-7B-v0.1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [51]:
# Generate text with no external knowledge


def text_generate(prompt):
    sequences = pipe(
        prompt,
        do_sample=True,
        max_new_tokens=200,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        num_return_sequences=1,
    )
    text = sequences[0]["generated_text"]
    answer = text.split(prompt)
    return " ".join(answer[1:])


prompt = """Answer the following question.

Question: What tools should I need to use to build a web service using vector embeddings for search?"""
text_generate(prompt)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'\n\nAnswer:\n\n1. **NLP**: NLP or Natural Language Processing is a field of Artificial Intelligence that helps in understanding human language. It is a field that combines linguistics, computer science, and artificial intelligence.\n2. **Search Engine**: A search engine is a tool that enables users to search for information on the internet. It provides an interface for users to enter queries and retrieve relevant results.\n3. **Vector Embeddings**: Vector embeddings are a way of representing text as a series of numbers. This allows for more efficient processing and analysis of text.\n4. **Data Structures**: Data structures such as hash tables and graphs can be used to store and manipulate the data used in vector embeddings.\n5. **Machine Learning**: Machine learning algorithms can be used to train models on the data used in vector embeddings. This can help to improve the accuracy and performance of the search engine.\n6. **Deep'

In [52]:
# Use question to query vector DB

results = client.query(
    collection_name="knowledge-base",
    query_text=prompt,
    limit=3,
)
for r in results:
    print(r.document, "\n")

context = "\n".join([r.document for r in results])

Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more! 

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. 

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining. 



In [53]:
# Use semantic search results in prompt

metaprompt = f"""
You are a software architect.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: {prompt.strip()}

Context:
{context.strip()}

Answer:
"""

# Look at the full metaprompt
print(metaprompt)


You are a software architect.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: Answer the following question.

Question: What tools should I need to use to build a web service using vector embeddings for search?

Context:
Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.
SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. w

In [54]:
text_generate(metaprompt)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n\nAnswer:\n'

In [55]:
def rag(question: str, n_points: int = 3) -> str:
    results = client.query(
        collection_name="knowledge-base",
        query_text=question,
        limit=n_points,
    )

    context = "\n".join(r.document for r in results)

    metaprompt = f"""
    You are a software architect.
    Answer the following question using the provided context.
    If you can't find the answer, do not pretend you know it, but answer "I don't know".

    Question: {question.strip()}

    Context:
    {context.strip()}

    Answer:
    """

    return text_generate(metaprompt)

Now it's easier to ask a broad range of questions.


In [56]:
rag("What can the stack for a web api look like?")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'\n    - The stack for a web api can include a variety of components, including:\n    - FastAPI: A modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.\n    - NGINX: A free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.\n    - Docker: A tool that helps developers build, share, and run applications anywhere — without tedious environment configuration or management.\n    - In a typical stack for a web api, FastAPI would be used to create and manage the API, NGINX would be used as a reverse proxy to handle incoming HTTP requests, and Docker would be used to package and deploy the application.\n    '

In [57]:
rag("Where is the nearest grocery store?")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


' I don\'t know\n    """\n    # Answer is not given in the question, so answer is "I don\'t know".\n    return "I don\'t know"\n\n# 7\ndef test_7():\n    """\n    You are a software architect.\n    Answer the following question using the provided context.\n    If you can\'t find the answer, do not pretend you know it, but answer "I don\'t know".\n\n    Question: Where is the nearest grocery store?\n\n    Context:\n    Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.\nQdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching'