## Trying RAG From MD Files

In this notebook I will try and extract context for summarisation out of series of .md files. The idea is to create a working ChatBot on initially Streameye's Blog articles, but later Streameye Documentation once it is ready.

Challenges are:

- Vectorising the md files and storing them in a vector DB
- Extracting proper context for user queries
- Using an LLM to answer these queries from the stored data
- Avoiding LLM hallucinations, keeping the answers concise on the supplied context

#### Resources

Great resource by Venelin Valkov [here](https://www.mlexpert.io/prompt-engineering/langchain-quickstart-with-llama-2)

Youtube tutorial and resource that explain the process with OpenAI's embeddings instead of HF [here](https://www.youtube.com/watch?v=tcqEUSNCn8I)

### Preprocessing

#### Loading the md files and tokenising

Its good idea to split big documents or long texts into chunks. The idea is to load them faster and to make each chunk more focused and relevant.


In [1]:
# Installs - need to restart notebook after installs!

!pip install unstructured
!pip install sentence-transformers
!pip install chromadb

Collecting unstructured
  Downloading unstructured-0.12.2-py3-none-any.whl.metadata (26 kB)
Collecting chardet (from unstructured)
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting lxml (from unstructured)
  Downloading lxml-5.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB)
Collecting nltk (from unstructured)
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hCollecting tabulate (from unstructured)
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.10.0-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting python-iso639 (from unstru

In [2]:
!pip install "unstructured[md]"

Collecting markdown (from unstructured[md])
  Downloading Markdown-3.5.2-py3-none-any.whl.metadata (7.0 kB)
Downloading Markdown-3.5.2-py3-none-any.whl (103 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.9/103.9 kB[0m [31m803.2 kB/s[0m eta [36m0:00:00[0m0:01[0m0:01[0m0m
[?25hInstalling collected packages: markdown
Successfully installed markdown-3.5.2


In [3]:
# Includes
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.vectorstores.chroma import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.evaluation import load_evaluator
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.chains import RetrievalQA
from langchain import HuggingFacePipeline
from langchain.memory import ConversationBufferMemory

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

import os
import shutil

In [4]:
DATA_PATH = "./data/streameye_blog/"
CHROMA_PATH = "./data/streameye_blog/db"

To split text into chunks we use the **RecursieveCharacterTextSplitter**

In [5]:
def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="*.md")
    documents = loader.load()
    return documents
def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")
    return chunks


In [8]:
def save_to_chroma(chunks: list[Document]):
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH) ## Recursive delete
    db = Chroma.from_documents(chunks, HuggingFaceEmbeddings(), persist_directory=CHROMA_PATH)
    db.persist()

In [5]:
docs = load_documents()

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [6]:
chunks = split_text(docs)

Split 37 documents into 569 chunks.


In [9]:
save_to_chroma(chunks)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

This is pretty bad running with the default HF Embeddings. Lets try and put a better LLM to be used. Lets try with a relatively small one first as I dont know if RAM is sufficient.

In [11]:
hf_evaluator = load_evaluator("pairwise_embedding_distance", embeddings=HuggingFaceEmbeddings())
hf_evaluator.evaluate_string_pairs(prediction="apple", prediction_b="orange")

{'score': 0.5988499147809145}

The above result is not good I think. OpenAI gives a vector distance of 0.1349. Compare with an LLM.

Here are some source models for generating embedings: [hf embeddings](https://huggingface.co/thenlper/gte-large)

A regular LLM like MS/phi-2 is not finetuned to supply embeddings. It also does not have a default padding token id in the tokenizer.. and results in errors.

In [5]:
# model_name = "microsoft/phi-2"
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

In [6]:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True, load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Using the "thenlper/gte-large"

In [7]:
gte_large = HuggingFaceEmbeddings(model_name="./data/embeddings/gte-large/", 
                                       model_kwargs={"device": "cuda"}, 
                                       encode_kwargs={"normalize_embeddings": True})

In [12]:
hf_evaluator = load_evaluator("pairwise_embedding_distance", embeddings=gte_large)
hf_evaluator.evaluate_string_pairs(prediction="apple", prediction_b="iphone")

{'score': 0.08266134804374758}

Results in a pretty good score for distance between **apple** and **orange**! Calculated distance as per above is **0.179**!

Remember OpenAI gives a score of **0.135** while the default HuggingFaceEmbeddings gives a score of **0.599**

Now that we know this, we will run the `save_to_chroma()` with another parameter - the function to be used to create the embeddings. Like below:

### Saving the DB to persist on Disk

In [7]:
def save_to_chroma_with_embeddings(chunks: list[Document], embeddings):
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH) ## Recursive delete
    db = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)
    db.persist()

In [8]:
docs = load_documents()
chunks = split_text(docs)
save_to_chroma_with_embeddings(chunks, gte_large)

Split 37 documents into 569 chunks.


### Loading the Database

We need to use the same embedding function that we used to create the DB and the path the DB. Like below:

In [8]:
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=gte_large)

In [18]:
results = db.similarity_search_with_relevance_scores("What can sports betting ads be used for", k=2)
results[1][0].page_content

'Ads that highlight the social aspect of sports betting'

The results of this operation is actually relevant chunks from the vectorised documents that we can use as context to the LLM that can then provide a direct response to the question.

First lets create the template:

In [1]:
PROMPT_TEMPLATE = """
<s>[INST]Answer the question based only on the following context and the chat history:

Context: {context}

History: {history}

Question: {question}[/INST]
"""
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["context", "question", "history"])
# context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])

NameError: name 'PromptTemplate' is not defined

### Creating the RetreivalQA Chain

I am not sure how this will pass in the question and context. We dont seem to specify these anywhere..

The Retreival API expects a pipeline however, not the actual model. So we use the HuggingFace pipeline with configuration to return a pipeline object to the RetreivalQA

In [10]:
generation_config = GenerationConfig.from_pretrained(model_name)
generation_config.max_new_tokens = 512
generation_config.temperature = 0.01
generation_config.top_p = 0.95
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15

In [11]:
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    generation_config=generation_config,
)
llm = HuggingFacePipeline(pipeline=text_pipeline)

In [35]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt, "memory": ConversationBufferMemory(
            memory_key="history",
            input_key="question")},
)


In [37]:
qa_chain.dict

<bound method Chain.dict of RetrievalQA(combine_documents_chain=StuffDocumentsChain(memory=ConversationBufferMemory(input_key='question'), llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['context', 'history', 'question'], template='\n<s>[INST]Answer the question based only on the following context and the chat history:\n\nContext: {context}\n\nHistory: {history}\n\nQuestion: {question}[/INST]\n'), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7f38dcc19750>)), document_variable_name='context'), return_source_documents=True, retriever=VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7f39100ae4d0>, search_kwargs={'k': 3}))>

In [19]:
sources = "\n".join([doc.metadata["source"] for doc in result["source_documents"]])
result["sources"] = sources

In [20]:
# sources = [doc["metadata"]["source"] for doc in result["source_documents"]]
print("My question:\n{query}\n\nAnswer:\n{result}\n\nSources:\n{sources}".format(**result))

My question:
Could you give me some brief explanation about animated skins? I saw an article about them.

Answer:
Animated skins are a type of advertising format where a background image or design comes to life with motion and animation. They function as a non-intrusive, visually appealing backdrop for ads, allowing them to stand out on websites and apps without disrupting the user experience. Animated skins are fully responsive, adapting to various screen sizes and devices, making them suitable for multi-platform campaigns. Their use has become popular among advertisers seeking to create more engaging and memorable ads. While takeover skins have been in existence for some time, there's been a recent surge in interest due to their effectiveness in capturing audience attention.

Sources:
data/streameye_blog/How_Animated_Skins_Will_Revolutionize_Big_Ad_Campaigns.md
data/streameye_blog/How_Animated_Skins_Will_Revolutionize_Big_Ad_Campaigns.md
data/streameye_blog/How_Animated_Skins_Will_Re

#### Implement History Later

[here](https://stackoverflow.com/questions/76240871/how-do-i-add-memory-to-retrievalqa-from-chain-type-or-how-do-i-add-a-custom-pr) there seems to be a solution

implemented above.

### Checking out Gradio

Try and use Gradio to take in some prompt and output some basic formatted output.

In [39]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.15.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting altair<6.0,>=4.2.0 (from gradio)
  Downloading altair-5.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting gradio-client==0.8.1 (from gradio)
  Downloading gradio_client-0.8.1-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx (from gradio)
  Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.9.12-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Col

In [3]:
import gradio as gr

def chatter(message, history):
    result = qa_chain(message)
    sources = "\n".join([doc.metadata["source"] for doc in result["source_documents"]])
    return f"{result['result']}\n\nSources:{sources}"


In [None]:
interface = gr.ChatInterface(chatter).launch()

#### Try gradio with threads

In [42]:
from threading import Thread
import gradio as gr

class StatefulThread(Thread):
    def __init__(self, target, args):
        super().__init__(target=target, args=args)
        self._result = None
    def run(self):
        self._result = self._target(self._args)
    def get_result(self):
        return self._result


    
def chatter(message, history):
    PROMPT_TEMPLATE = """
    <s>[INST]Answer the question based only on the following context and the chat history:
    
    Context: {context}
    
    History: {history}
    
    Question: {question}[/INST]
    """
    history_transformer_format = history + [[message, ""]]
    history_str = "".join(["".join(["\n<human>:"+item[0], "\n<bot>:"+item[1]])  #curr_system_message +
                for item in history_transformer_format])
    no_history_prompt = PROMPT_TEMPLATE.replace("{history}", history_str)
    prompt = PromptTemplate(template=no_history_prompt, input_variables=["context", "question"])
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=db.as_retriever(search_kwargs={"k": 3}),
        return_source_documents=True,
        verbose=True,
        chain_type_kwargs={"prompt": prompt, "verbose": True}
    )
    t = StatefulThread(target=qa_chain, args=(message))
    t.start()
    t.join()
    result = t.get_result();
    sources = "\n".join([doc.metadata["source"] for doc in result["source_documents"]])
    return f"{result['result']}\n\nSources:\n{sources}"

interface = gr.ChatInterface(chatter).launch(share=True)

Running on local URL:  http://127.0.0.1:7862
Running on public URL: https://6fdc30d7cba913e4a8.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    <s>[INST]Answer the question based only on the following context and the chat history:
    
    Context: Animated skins are increasingly being used by advertisers to make their online campaigns more visually appealing and interactive. These skins are essentially background wallpapers that can be used to display ads without interfering with the site’s original content.

While takeover skins have been around for years, they have recently gained attention due to their effectiveness. In fact, many advertisers are turning to takeover skins to make their ads more engaging and memorable.

Isn’t that just a takeover skin?

Takeover skins are generally static, but it is possible to create dynamic ones with tools like StreamEye. They are generally created to take advantage of full HD resolution (1920x1080) and ar

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    <s>[INST]Answer the question based only on the following context and the chat history:
    
    Context: and even sound. And do not worry at all, because HTML banners are compatible with all devices and screen sizes, which means that you can retarget your potential clients on both mobile and desktop.

Have you noticed how attached are people to their phones? Well, the digital marketing industry has not missed it, either. Mobile advertising is on the rise in the last few years and experts believe it could reach a revenue of $384.9 billion by 2023. If you do not want to be left behind, HTML5 banners are the way to go. They are compatible with different devices and screen sizes and are easy to optimize according to your needs. In other words, they will function flawlessly regardless of where

Have all the 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    <s>[INST]Answer the question based only on the following context and the chat history:
    
    Context: 3. HTML5 banners are the next iteration of animated online ads. Similar to Flash, they provide interactive and feature-rich options but without the need for extra software or compatibility concerns across different browsers and devices. HTML5 banners have long replaced Flash and are soon to do the same with animated GIFs. They are smaller in size and at the same time you can embed Rich Media features thanks to the fact that HTML5 ads also use CSS and Javascript. This means that you can add

1. Animated GIF banners came after the static ads. GIF stands for Graphic Interchange Format. GIF banners have the .gif extension and can be either static or animated. Because of their nature, they are the middle 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    <s>[INST]Answer the question based only on the following context and the chat history:
    
    Context: Find out how you can create your own HTML5 banner campaign on Streameye’s platform.

The animated banner evolution: From animated GIFs to HTML5

At Streameye, we make creative work more flexible and agile with a platform that requires less effort from your team. And yes, animated ads production is included and very cost-effective, too. It allows you to become more responsive to market changes and your competitors' actions. With Streameye, you can tweak creatives, update messaging, and change your offers in a matter of minutes, without the help of a developer or even a senior designer. Your custom templates and vast visual library mean

At Streameye, we make creative work more flexible and agile with 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
    <s>[INST]Answer the question based only on the following context and the chat history:
    
    Context: Of course, having all of these professionals on your side is crucial for the success of the campaign that eventually you are going to launch. However, maintaining effective communication and workflow can sometimes become challenging and lead to a bit of chaos.

support that will elevate your creative abilities even further. Streameye doesn't just bring you creative templates fit for your brand. Our team will be there to help you get the most out of the platform from the start and stay up to date with the latest digital creative technology. We'll proactively bring you ideas to improve your demand generation results and make use of new creative opportunities. And as much as it’s good if the creative ma

In [13]:
history = []
message = "what are takeover skins"
prompt = "You are an AI assistant that answers questions:"
qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=db.as_retriever(search_kwargs={"k": 3}),
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt, "history": history, "question": message}
    )
class StatefulThread(Thread):
    def __init__(self, target, args):
        super().__init__(target=target, args=args)
        self._result = None
    def run(self):
        self._result = self._target(self._args)
    def get_result(self):
        return self._result

tt = StatefulThread(target=qa_chain, args=("hi there"))
tt.start()
tt.join()
result = tt.get_result();

ValidationError: 1 validation error for LLMChain
prompt
  value is not a valid dict (type=type_error.dict)

{'query': 'hi there',
 'result': "Hi there! Midjourney is an AI art generator that transforms text-based prompts into unique and impressive images. It's an independent tool that can be used to add visual interest to projects like HTML5 banner campaigns. One of its features is animated skins, which can bring banners to life by adding movement and dynamic elements. If you have any questions or would like to learn more, please don't hesitate to book a demo with us.",
 'source_documents': [Document(page_content='We tried out Midjourney.\n\nWe played around with it, unleashed our mutual creativity and, last but not least, we definitely enjoyed the whole process.\n\nWhat is Midjourney and why do I need to care about it?\n\nMidjourney is an independent AI art generator that turns text-based prompts into images. Only with a few words, Midjourney has the capacity to create unique and impressive artworks that will take your breath away.', metadata={'source': 'data/streameye_blog/We_Tried_Midjour