# IO RAG Boilerplate
**First copy this Colab to your own drive:** Click the "File" tab at the top left, and then click the "Save a copy in Drive" button.

To make sure you hit the ground running I've developed a basic notebook that will allow you to play around with a basic RAG setup. This way you can start with focussing on finding a nice dataset before you start focussing on the performance of the system or expanding the application.  

This setup will use HuggingFace (HF) to download the models, Weaviate as a vector database and LangChain to connect the elements and allow for queries.

**NOTE:** The first run of this notebook takes about 5m to install all dependencies and download models from HF, you might want to hit "Runtime" -> "Run all" right now before you continue reading.

## Boilerplate setup
This Notebook contains the following components
* Installation, import and setup
* Download and embed data
* Download and quantize model
* Pipeline execution

Based on

https://github.com/tomasonjo/blogs/blob/master/weaviate/HubermanWeaviate.ipynb

# Installation, import and setup
You can largely ignore this code on your first run, below dependencies are installed, then most dependencies are imported and the Weaviate database is setup. You can add new packages you want to install and dependencies you want to import to this part of the notebook for code cleanliness.


In [1]:
# installing additional packages
!pip install -q transformers peft accelerate bitsandbytes safetensors sentencepiece streamlit weaviate-client langchain sentence-transformers tiktoken youtube-transcript-api pypdf einops langchain_community langchain-openai #datasets auto-gptq optimum


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m101.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.5/374.5 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# fixing unicode error in google colab
import locale
locale.getpreferredencoding = lambda: "UTF-8"

# import dependencies
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
    BitsAndBytesConfig,
)
from langchain.text_splitter import TokenTextSplitter
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.vectorstores import Weaviate
from langchain_core.prompts import PromptTemplate

import weaviate

In [3]:
import weaviate
from weaviate.embedded import EmbeddedOptions

client = weaviate.Client(
  embedded_options=EmbeddedOptions()
)

Python client v3 `weaviate.Client(...)` connections and methods are deprecated and will
            be removed by 2024-11-30.

            Upgrade your code to use Python client v4 `weaviate.WeaviateClient` connections and methods.
                - For Python Client v4 usage, see: https://weaviate.io/developers/weaviate/client-libraries/python
                - For code migration, see: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration

            If you have to use v3 code, install the v3 client and pin the v3 dependency in your requirements file: `weaviate-client>=3.26.7;<4.0.0`
  client = weaviate.Client(
INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.26.1/weaviate-v1.26.1-Linux-amd64.tar.gz
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 1553


# Download and embed data
Below we decide and download an embedding model, download the data, vectorize it and load it to the Weaviate database.


## Choose Embedding
As covered in the presentation the Embedding model is an important component of the RAG system. It basically runs words into vectors that represent similarity of words. This is what determines if the word King is closely related to the word Queen whom in term are similarly related to the word Man and the word Women. This means that the performance of this model determines how well the RAG system can find documents that are "similar" to the user query.

By changing the embedding_model_name below you will instruct HF to download a different embedding model. The default is a well performing model that is relatively small so should be fine. If you want to optimize the system by changing the embedding model, have a look at the HF Leaderboard, via the link below. Note that the best performing models are 60x larger than the default I set, meaning that it will take long to download and exceed the available GPU memory on Google Colab to even load it in, be wary of the Model Size.

https://huggingface.co/spaces/mteb/leaderboard


In [4]:

# specify embedding model (using huggingface sentence transformer)
embedding_model_name = "BAAI/bge-base-en-v1.5"

# arguments that are used to configure the model, these will be different for most models.
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model_name, model_kwargs=model_kwargs
)

  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Download and embed data
Below we will download a dataset, decide on a chunking strategy and use the embedding model to embed the documents and load them to Weaviate. I will provide examples of different types of datasets, which you can toggle between with the parameter below:

In [5]:
'''
This parameter can be one of the following options
pdf
youtube
'''
rag_dataset = 'youtube'

### PDF IPCC report on climate change
In this dataset example we will download the full ICPP report on climate change which is a 200 page document on the effects of climate change. While a document like that is hard to read through it is a very valuable collection of knowledge. A RAG setup will allow you to "ask questions to the document" such that you can ask the specific climate related questions you are interested in.

In [6]:
if rag_dataset == 'pdf':
    from langchain.document_loaders import PyPDFLoader

    # IPCC 2023 Climate change report
    pdf_url = 'https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_FullVolume.pdf'

    loader = PyPDFLoader(pdf_url)
    pages = loader.load()

    # We chuck the document in basis 1028 token chunks without overlap to load into Weaviate
    text_splitter = TokenTextSplitter(chunk_size=1028, chunk_overlap=0)
    split_docs = text_splitter.split_documents(pages)



### YouTube Captions Techlinked
For this dataset we will retrieve the 10 latest videos of a Youtube channel and then load the captions of those videos. For this example we will use the videos of Techlinked which is a youtube channel that covers the latest tech news 3 times a week, so this will give us approximately three weeks of tech news to explore.


In [7]:
if rag_dataset == 'youtube':
    import requests
    import xml.etree.ElementTree as ET

    # Techlinked channel RSS feed
    URL = "https://www.youtube.com/feeds/videos.xml?channel_id=UCeeFfhMcJa1kjtfZAGskOCA"

    response = requests.get(URL)
    xml_data = response.content

    # Parse the XML data
    root = ET.fromstring(xml_data)

    # Define the namespace
    namespaces = {
        "atom": "http://www.w3.org/2005/Atom",
        "media": "http://search.yahoo.com/mrss/",
    }

    # Extract YouTube links
    youtube_links = [
        link.get("href")
        for link in root.findall(".//atom:link[@rel='alternate']", namespaces)
    ][1:]

    # Download and split the captions of the collected youtube videos
    from langchain.document_loaders import YoutubeLoader

    all_docs = []
    for link in youtube_links:
        loader = YoutubeLoader.from_youtube_url(link)
        docs = loader.load()
        all_docs.extend(docs)
    text_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=0)
    split_docs = text_splitter.split_documents(all_docs)



## Embedding and storing the documents
Now simply provide the split documents and the loaded embedding model to Weaviate to populate the vector database.

In [8]:
vector_db = Weaviate.from_documents(
    split_docs, embeddings, client=client, by_text=False
)

### Testing document retrieval
You can now quickly test the retrieval accuracy by providing a query and ask for the top 3 most similar documents in the database.


In [9]:

vector_db.similarity_search(
    "What's the most promising new technology to preserve marine biodiversity", k=3)

[Document(metadata={'source': 'z-04C2SxZsY'}, page_content=" news to me a fish person with multiple [Music] partners that's a that's a polymer person polymer person by vaporizing certain unfortunate polymers they can be reduced to their building blocks to make new Plastics now does this solve the whole microplastics inside of us and also the ocean and in our brains issue probably not given that we're vaporizing them but Chinese researchers are having success making a robust yet compostable hard plastic out of bamboo and that could help keep the ocean cleaner saving many non- monogamous fish people and mid Journey the company behind AI image generator mid journey is getting into Hardware but they haven't been very forthcoming about"),
 Document(metadata={'source': 'i-fung6kstw'}, page_content=" and extensive radiation exposure finally a hard drive alternative that can take a bullet I keep losing all my hard drives that way these crystals could last billions of years possibly long enough

# Integrating the LLM
Now this is where the magic happens, here you will choose an LLM, Download it from HF and Quantize it such that it will fit on the Google Colab GPU. By changing the model_name variable you can specify the model that needs to be downloaded. You can look at the top performing open LLMs on the HF leaderboard, however again, note that the top performers are very large models which will not fit on the Google Colab GPU. Even the smaller models will not fit in there straight away and will require quantization to be run in Google Colab.

7 Billion parameters is somewhat of a standard for "small" Large Language Models. Most open model families have a 7B version which will fit in the GPU once quantized.

As a default I chose the 7B model of the Qwen2 model family of which the larger versions are now top ranking amongst open LLMs. This model family is created by the Alibaba Group though you are free to choose a more western aligned model like the French Mistral 7B. Note that the configuration parameters below can differ per model, however, these are quite standard and should work with most models.


HF Open LLM leaderboard:

https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard


## Tips
* **Clearing GPU memory:** In an interactive notebook like this all objects are kept in memory, meaning that if you run the code below a couple of times your GPU Memory will be filled up quickly with model objects. This can be quickly solved by Restarting the python session with the button under the Runtime tab. This will clear all Python objects from memory but will not affect the changes made on disk, meaning that the installed packages and cashed HF models should still be there.

In [10]:
from transformers import pipeline

# specify model huggingface mode name
#model_name = "cognitivecomputations/dolphin-2.6-mistral-7b"
model_name = "Qwen/Qwen2-7B-Instruct"


# function for loading 4-bit quantized model
def load_quantized_model(model_name: str):
    """
    :param model_name: Name or path of the model to be loaded.
    :return: Loaded quantized model.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config
    )
    return model

# function for initializing tokenizer
def initialize_tokenizer(model_name: str):
    """
    Initialize the tokenizer with the specified model_name.

    :param model_name: Name or path of the model for tokenizer initialization.
    :return: Initialized tokenizer.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name, return_token_type_ids=False)
    tokenizer.bos_token_id = 1  # Set beginning of sentence token id
    return tokenizer

# initialize tokenizer
tokenizer = initialize_tokenizer(model_name)
# load model
model = load_quantized_model(model_name)
# specify stop token ids
stop_token_ids = [0]

# build huggingface pipeline
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    use_cache=True,
    max_length=2048,
    do_sample=True,
    top_k=5,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)

# specify the llm
llm = HuggingFacePipeline(pipeline=pipeline)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

  llm = HuggingFacePipeline(pipeline=pipeline)


# OpenAI Intergration

In [None]:
# import getpass

# if "OPENAI_API_KEY" not in os.environ:
#     os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [None]:
# from langchain_openai import OpenAI

# llm = OpenAI(base_url='https://iogpt-api-management-service.azure-api.net/openai/api/proxy/openai')

# Start Querying
Now with everything setup and loaded we can start querying our documents. First we setup a QA chain in LangChain combining the defined LLM, the vector store and the retrieval strategy. Do you want to know what different retrieval strategies are available out of the box? Ask you favourite AI Chat ;)

Then we define the preprompt and add it to a prompt template. Note that we will only use the prompt template to format the first question in the chain, not every question that comes thereafter.

In [11]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=vector_db.as_retriever()
)

# Define the pre-prompt template
template = """You are a helpful AI agent that is an expert on climate change.
  You will use the information from IPCC to anwser the user's questions on climate change: {query}"""
prompt = PromptTemplate(template=template)


In [12]:

# The user's question
question = "What's the best way to save the climate?"

# Use the prompt template to format the question
formatted_prompt = prompt.format(query=question)
response = qa_chain.run(formatted_prompt)
print(response)

  response = qa_chain.run(formatted_prompt)
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

 news to me a fish person with multiple [Music] partners that's a that's a polymer person polymer person by vaporizing certain unfortunate polymers they can be reduced to their building blocks to make new Plastics now does this solve the whole microplastics inside of us and also the ocean and in our brains issue probably not given that we're vaporizing them but Chinese researchers are having success making a robust yet compostable hard plastic out of bamboo and that could help keep the ocean cleaner saving many non- monogamous fish people and mid Journey the company behind AI image generator mid journey is getting into Hardware but they haven't been very forthcoming about

 peak into the trough of disillusionment but it would make more sense if open AI is considering removing its cap on investor profits as reported by the fi

In [13]:
response = qa_chain.run("What's the most promising new technology to preserve marine biodiversity")
print(response)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

 news to me a fish person with multiple [Music] partners that's a that's a polymer person polymer person by vaporizing certain unfortunate polymers they can be reduced to their building blocks to make new Plastics now does this solve the whole microplastics inside of us and also the ocean and in our brains issue probably not given that we're vaporizing them but Chinese researchers are having success making a robust yet compostable hard plastic out of bamboo and that could help keep the ocean cleaner saving many non- monogamous fish people and mid Journey the company behind AI image generator mid journey is getting into Hardware but they haven't been very forthcoming about

 and extensive radiation exposure finally a hard drive alternative that can take a bullet I keep losing all my hard drives that way these crystals could l

In [15]:
question = input("What's your question: ")
response = qa_chain.run(question)
print(response)

What's your question: How many cars should we have in order to make world dark by co2 in the air?
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

 thanks to a partnership with whmo so don't be too surprised if you hail a ride sometime in early 2025 and an empty car shows up app users can apparently increase the likelihood of getting chauffered by a ghost by opting into autonomous rides but it's not clear if that includes the option to be completely opted out can you request a Victorian ghost while there have been concerns of Robo taxis roaming the streets without human supervision weo has been comparatively open about its crash statistics which not only show that wayo taxis get into severe crashes around a third as much as human drivers but also that most of those crashes involved getting rear ended by Flesh and Blood

 Etc the ban also covers automated Driving Systems wh

In [14]:
question = input("What's your question: ")
response = qa_chain.run(question)
print(response)

What's your question: Who is Khameneie?
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

I wanted to surprise you with a little gift today okay in this episode I'm going to list all my favorite things about you number one I guess we'll do Tech news telegram CEO pav durov was arrested last weekend on suspicion of failing to moderate criminal activity on the messaging app but on Wednesday that suspicion was upgraded to preliminary charges as jov was released on bail and barred from leaving France pending further investigation the moved seemed sudden but it makes some sense given that back in March jov told the financial times he doesn't think they should be policing the way people Express themselves unless they cross red lines which red lines unclear although it seems like

 be the development of an AI personality you know something that will make their rumored iPad on a rob

In [None]:
question = input("What's your question: ")
response = qa_chain.run(question)
print(response)