# About

- Use [Langchain](https://python.langchain.com/en/latest/index.html) to build a chatbot that can answer questions about [Harry Potter books](https://www.kaggle.com/datasets/hinepo/harry-potter-books-in-pdf-1-7)
- Experiment with various LLMs (Large Language Models)
- Use [FAISS vector store](https://python.langchain.com/docs/integrations/vectorstores/faiss) to store text embeddings with [Sentence Transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) from [Hugging Face](https://huggingface.co/hkunlp/instructor-large). FAISS runs on GPU and it is much faster than Chroma
- Use [Retrieval chain](https://python.langchain.com/docs/modules/data_connection/retrievers/) to retrieve relevant passages from embedded text
- Summarize retrieved passages
- Leverage Kaggle dual GPU (2 * T4) with [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index)
- Chat UI with [Gradio](https://www.gradio.app/guides/quickstart)

### Models

- [WizardLM](https://huggingface.co/TheBloke/wizardLM-7B-HF)
- [Falcon](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2)
- [Llama 2-7b](https://huggingface.co/daryl149/llama-2-7b-chat-hf)
- [Llama 2-13b](https://huggingface.co/daryl149/llama-2-13b-chat-hf)
- [Bloom](https://huggingface.co/bigscience/bloom-7b1)

In [None]:
# Run this cell and select the kaggle.json file downloaded
# from the Kaggle account settings page.
from google.colab import files
files.upload()

KeyboardInterrupt: 

In [None]:
# Let's make sure the kaggle.json file is present.
!ls -lha kaggle.json

-rw-r--r-- 1 root root 64 Jan 21 07:13 kaggle.json


In [None]:
# Next, install the Kaggle API client.
!pip install -q kaggle

In [None]:
# The Kaggle API client expects this file to be in ~/.kaggle,
# so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d hustzx/llama-2-13b-chat-hf

Downloading llama-2-13b-chat-hf.zip to /content
100% 37.3G/37.3G [19:54<00:00, 36.1MB/s]
100% 37.3G/37.3G [19:54<00:00, 33.6MB/s]


In [None]:
!unzip /content/llama-2-13b-chat-hf.zip -d /content/llama-2-13b-chat-hf

Archive:  /content/llama-2-13b-chat-hf.zip
  inflating: /content/llama-2-13b-chat-hf/LICENSE.txt  
  inflating: /content/llama-2-13b-chat-hf/README.md  
  inflating: /content/llama-2-13b-chat-hf/Responsible-Use-Guide.pdf  
  inflating: /content/llama-2-13b-chat-hf/USE_POLICY.md  
  inflating: /content/llama-2-13b-chat-hf/config.json  
  inflating: /content/llama-2-13b-chat-hf/generation_config.json  
  inflating: /content/llama-2-13b-chat-hf/model-00001-of-00003.safetensors  
  inflating: /content/llama-2-13b-chat-hf/model-00002-of-00003.safetensors  
  inflating: /content/llama-2-13b-chat-hf/model-00003-of-00003.safetensors  
  inflating: /content/llama-2-13b-chat-hf/model.safetensors.index.json  
  inflating: /content/llama-2-13b-chat-hf/pytorch_model-00001-of-00003.bin  
  inflating: /content/llama-2-13b-chat-hf/pytorch_model-00002-of-00003.bin  
  inflating: /content/llama-2-13b-chat-hf/pytorch_model-00003-of-00003.bin  
  inflating: /content/llama-2-13b-chat-hf/pytorch_model.bin.i

![image.png](attachment:cdc462a7-e241-4332-821a-fa369a853128.png)

img source: HinePo

In [None]:
! nvidia-smi -L

# Installs

This steps takes
- ~16 s on Colab
- ~4 min on kaggle

In [None]:
%%time

! pip install -qq -U langchain tiktoken pypdf faiss-gpu
! pip install -qq -U transformers InstructorEmbedding sentence_transformers
! pip install -qq -U accelerate bitsandbytes xformers einops

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.4.1 requires cupy-cuda11x<12.0.0a0,>=9.5.0, which is not installed.
cuml 23.4.1 requires cupy-cuda11x<12.0.0a0,>=9.5.0, which is not installed.
dask-cudf 23.4.1 requires cupy-cuda11x<12.0.0a0,>=9.5.0, which is not installed.
cudf 23.4.1 requires protobuf<4.22,>=4.21.6, but you have protobuf 3.20.3 which is incompatible.
cuml 23.4.1 requires dask==2023.3.2, but you have dask 2023.5.0 which is incompatible.
dask-cuda 23.4.0 requires dask==2023.3.2, but you have dask 2023.5.0 which is incompatible.
dask-cudf 23.4.1 requires dask==2023.3.2, but you have dask 2023.5.0 which is incompatible.
distributed 2023.3.2.1 requires dask==2023.3.2, but you have dask 2023.5.0 which is incompatible.
google-cloud-bigquery 2.34.4 requires packaging<22.0dev,>=14.3, but you have packaging 23.2 which is incompatible.
jupyte

# Imports

In [None]:
import warnings
warnings.filterwarnings("ignore")

import os
import glob
import textwrap
import time

import langchain

# loaders
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

# splits
from langchain.text_splitter import RecursiveCharacterTextSplitter

# prompts
from langchain import PromptTemplate, LLMChain

# vector stores
from langchain.vectorstores import FAISS

# models
from langchain.llms import HuggingFacePipeline
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

# retrievers
from langchain.chains import RetrievalQA

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

print('LangChain:', langchain.__version__)

In [None]:
glob.glob('/kaggle/input/harry-potter-books-in-pdf-1-7/HP books/*')

# CFG

- CFG class enables easy and organized experimentation

In [None]:
class CFG:
    # LLMs
    model_name = 'llama2-13b' # wizardlm, bloom, falcon, llama2-7b, llama2-13b
    temperature = 0,
    top_p = 0.95,
    repetition_penalty = 1.15

    # splitting
    split_chunk_size = 800
    split_overlap = 0

    # embeddings
    embeddings_model_repo = 'sentence-transformers/all-MiniLM-L6-v2'

    # similar passages
    k = 3

    # paths
    PDFs_path = '/kaggle/input/harry-potter-books-in-pdf-1-7/HP books/'
    Embeddings_path =  '/kaggle/input/faiss-hp-sentence-transformers'
    Persist_directory = './harry-potter-vectordb'

# Define model

In [None]:
def get_model(model = CFG.model_name):

    print('\nDownloading model: ', model, '\n\n')

    if model == 'wizardlm':
        model_repo = 'TheBloke/wizardLM-7B-HF'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True
        )

        max_len = 1024

    elif model == 'llama2-7b':
        model_repo = 'daryl149/llama-2-7b-chat-hf'

        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )

        max_len = 2048

    elif model == 'llama2-13b':
#         model_repo = 'daryl149/llama-2-13b-chat-hf' # from Hugging Face
        model_repo = '/kaggle/input/llama-2-13b-chat-hf' # from kaggle dataset

        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,

#             bnb_4bit_quant_type='nf4',  # Normalized float 4
#             bnb_4bit_use_double_quant=True,  # Second quantization after the first
#             bnb_4bit_compute_dtype=bfloat16,  # Computation type

            device_map='auto',
            torch_dtype=torch.float16,
#             low_cpu_mem_usage=True,
            trust_remote_code=True
        )

        max_len = 8192

    elif model == 'bloom':
        model_repo = 'bigscience/bloom-7b1'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
        )

        max_len = 1024

    elif model == 'falcon':
        model_repo = 'h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )

        max_len = 1024

    else:
        print("Not implemented model (tokenizer and backbone)")

    return tokenizer, model, max_len

This steps takes
- ~3-9 min on Colab
- ~5-8 min on kaggle, sometimes much more, up to 35 min

In [None]:
%%time

tokenizer, model, max_len = get_model(model = CFG.model_name)

In [None]:
model.eval()

In [None]:
### check how Accelerate split the model across the available devices (GPUs)
model.hf_device_map

# Pipeline

- Hugging Face pipeline

In [None]:
pipe = pipeline(
    task = "text-generation",
    model = model,
    tokenizer = tokenizer,
    pad_token_id = tokenizer.eos_token_id,
    max_length = max_len,
    temperature = CFG.temperature,
    top_p = CFG.top_p,
    repetition_penalty = CFG.repetition_penalty
)

llm = HuggingFacePipeline(pipeline = pipe)

In [None]:
llm

In [None]:
%%time
### testing model, not using the harry potter books yet
### answer is not necessarily related to harry potter
query = "Give me 5 examples of cool potions and explain what they do"
llm(query)

# Langchain

- Multiple document retriever with LangChain

In [None]:
CFG.model_name

# Loader

- [Directory loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for multiple files

    - load de PDF files
    - split into chunks
    - create embeddings
    - save the embeddings in a vector store
    - After that, load the saved embeddings to do similarity search with the user query, and then use the LLM to answer the question

In [None]:
%%time

loader = DirectoryLoader(
    CFG.PDFs_path,
    glob="./*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
)

documents = loader.load()

In [None]:
print(f'We have {len(documents)} pages in total')

In [None]:
documents[8].page_content

# Splitter

- Splitting the text into chunks so its passages are easily searchable for similarity
- This step is also only necessary if you are creating the embeddings
- [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/reference/modules/document_loaders.html?highlight=RecursiveCharacterTextSplitter#langchain.document_loaders.MWDumpLoader)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CFG.split_chunk_size,
    chunk_overlap = CFG.split_overlap
)

texts = text_splitter.split_documents(documents)

print(f'We have created {len(texts)} chunks from {len(documents)} pages')

# Create Embeddings


- Embedd and store the texts in a Vector database (FAISS)
- [LangChain Vector Stores docs](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
- [FAISS - langchain](https://python.langchain.com/docs/integrations/vectorstores/faiss)
- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks - paper Aug/2019](https://arxiv.org/pdf/1908.10084.pdf)
- [This is a nice 4 minutes video about vector stores](https://www.youtube.com/watch?v=dN0lsF2cvm4)
- [Chroma - Persist and load the vector database](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html)


In [None]:
# %%time

# ### download embeddings model
# embeddings = HuggingFaceInstructEmbeddings(
#     model_name = CFG.embeddings_model_repo,
#     model_kwargs = {"device": "cuda"}
# )

# ### create embeddings and DB
# vectordb = FAISS.from_documents(
#     documents = texts,
#     embedding = embeddings
# )

# ### persist vector database
# vectordb.save_local("faiss_index_hp")

# Load vector database

- After saving the vector database, we just load it from the Kaggle Dataset I mentioned
- Obviously, the embeddings function to load the embeddings must be the same as the one used to create the embeddings

In [None]:
%%time

### download embeddings model
embeddings = HuggingFaceInstructEmbeddings(
    model_name = CFG.embeddings_model_repo,
    model_kwargs = {"device": "cuda"}
)

### load vector DB embeddings
vectordb = FAISS.load_local(
    CFG.Embeddings_path,
    embeddings
)

In [None]:
### test if vector DB was loaded correctly
vectordb.similarity_search('magic creatures')

# Prompt Template

- Custom prompt

In [None]:
prompt_template = """
Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer:"""


PROMPT = PromptTemplate(
    template = prompt_template,
    input_variables = ["context", "question"]
)

In [None]:
# llm_chain = LLMChain(prompt=PROMPT, llm=llm)
# llm_chain

# Retriever chain

- Retriever to retrieve relevant passages
- Chain to answer questions
- [RetrievalQA: Chain for question-answering](https://python.langchain.com/docs/modules/data_connection/retrievers/)

In [None]:
retriever = vectordb.as_retriever(search_kwargs = {"k": CFG.k, "search_type" : "similarity"})

qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever,
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

In [None]:
### testing MMR search
question = "Which are Hagrid's favorite animals?"
vectordb.max_marginal_relevance_search(question, k = CFG.k)

In [None]:
### testing similarity search
question = "Which are Hagrid's favorite animals?"
vectordb.similarity_search(question, k = CFG.k)

# Post-process outputs

- Format llm response
- Cite sources (PDFs)
- Change `width` parameter to format the output

In [None]:
def wrap_text_preserve_newlines(text, width=700):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text


def process_llm_response(llm_response):
    ans = wrap_text_preserve_newlines(llm_response['result'])

    sources_used = ' \n'.join(
        [
            source.metadata['source'].split('/')[-1][:-4] + ' - page: ' + str(source.metadata['page'])
            for source in llm_response['source_documents']
        ]
    )

    ans = ans + '\n\nSources: \n' + sources_used
    return ans

In [None]:
def llm_ans(query):
    start = time.time()
    llm_response = qa_chain(query)
    ans = process_llm_response(llm_response)
    end = time.time()

    time_elapsed = int(round(end - start, 0))
    time_elapsed_str = f'\n\nTime elapsed: {time_elapsed} s'
    return ans + time_elapsed_str

# Ask questions

- Question Answering from multiple documents
- Run QA Chain
- Talk to your data

In [None]:
CFG.model_name

In [None]:
model

In [None]:
query = "Which challenges does Harry face during the Triwizard Tournament?"
print(llm_ans(query))

In [None]:
query = "Is Malfoy an ally of Voldemort?"
print(llm_ans(query))

In [None]:
query = "What are horcrux?"
print(llm_ans(query))

In [None]:
query = "Give me 5 examples of cool potions and explain what they do"
print(llm_ans(query))

# Gradio Chat UI

- Create a chat UI with [Gradio](https://www.gradio.app/guides/quickstart)
- [ChatInterface docs](https://www.gradio.app/docs/chatinterface)
- The notebook should be running if you want to use the chat interface
- Print of the chat UI below
- **Gradio has better compatibility with Colab than with Kaggle. If you plan to use the interface, it is preferable to do so in Google Colab**

In [None]:
### necessary for google colab
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
! pip install --upgrade gradio -qq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m305.2/305.2 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m381.9/381.9 kB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m50.5 M

In [None]:
!pip install gradio

In [None]:
import gradio as gr
print(gr.__version__)

4.15.0


In [None]:
    def predict(message, history):
        # output = message # debug mode

        output = str(llm_ans(message)).replace("\n", "<br/>")
        return output

    demo = gr.ChatInterface(
        predict,
        title = f' Open-Source LLM ({CFG.model_name}) for Harry Potter Question Answering'
    )

    demo.queue()
    demo.launch()

NameError: name 'CFG' is not defined

# Conclusions

    - Prompt engineering
    - Bigger models
    - Other models families
    - Splitting: chunk size, overlap
    - Search: Similarity, MMR, k
    - Pipeline parameters (temperature, top_p, penalty)
    - Embeddings function
    - LLM parameters (max len)